Generating Landslide Archive Inventories Using Web Scraping and NLP Techniques for Türkiye | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Generating Landslide Archive Inventories Using Web Scraping and NLP Techniques for Türkiye Elnaz Najatishendi, Tolga Görüm, Seçkin Fidan, Fusun Balık Şanlı This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7463555/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 26 Dec, 2025 Read the published version in Natural Hazards → Version 1 posted 5 You are reading this latest preprint version Abstract Landslides are among the most frequent natural hazards that cause significant loss of life and serious economic damage worldwide. Although many inventories have been created using different approaches to understand landslide events, these are rarely updated automatically or in real time. Traditional approaches are laborious processes due to the time and intensive labor requirements, and are limited in terms of timeliness due to reporting delays. To address these challenges, we developed an automated approach that integrates web scraping, natural language processing (NLP), and geocoding techniques using digital media news sources in Türkiye to create a landslide archive inventory. Our algorithm verified 1727 of the 3051 news articles it captured between 1997 and 2024 as landslides and identified a total of 478 fatalities in 212 deadly incidents. 66.5% of the landslides captured on the web were located at the neighborhood/village level, providing substantial spatial accuracy. This location accuracy has also enabled risk estimation at the neighborhood/village level. Comparison with the manual national inventory shows moderate agreement, with F1 scores ranging from 0.434 to 0.552 in ± 1 and ± 7 daytime windows. The automated method not only captures spatial and temporal patterns of landslides but also extracts key attributes such as location, number of fatalities, and triggering factors (i.e., natural and anthropogenic). Our study demonstrates the potential of web-based automated approaches to complement traditional landslide inventories by providing near-real-time and verifiable data. Finally, we suggest adopting common reporting standards for natural hazard digital newspapers so that this approach can spread globally. Landslides Landslide inventory Web scraping Natural language processing Geocoding Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Figure 9 1. Introduction Landslides are among the most common natural hazards, resulting in substantial loss of life and substantial economic damage worldwide (Fidan et al., 2024 ; Froude & Petley, 2018 ; Kirschbaum et al., 2015 ). Landslides triggered by rainfall (Emberson et al. 2022 ; Ozturk et al. 2022 ), earthquakes (Görum et al., 2011, 2025; Tanyaş et al., 2017 ), and human activities (Guns and Vanacker 2014 ; Depicker et al. 2021 ; Ozturk et al. 2022 ) claim substantial loss of life and socio-economic damage (Froude and Petley 2018 ). As a result, recording landslide events contributes to the prevention of loss of life and property by providing an understanding of their spatial and temporal distribution, as well as identifying the factors that control their formation (Gómez et al., 2023 ; Kirschbaum et al., 2010 ; van Westen et al., 2006 ). Landslide inventories are a fundamental data source for susceptibility, hazard, and risk analyses (Guzzetti et al. 2005 , 2012 ; van Westen et al. 2006 ; Rossi et al. 2019 ; Caleca et al. 2025 ), as well as the development of early warning systems (Guzzetti et al. 2020 ; Fang et al. 2023 ). Compiled to obtain more insights into landslides, inventories are typically of archive, historical, event-based, seasonal, and multi-temporal (Guzzetti et al. 2012 ). Among these, archive inventories with large spatial and temporal coverage are widely used, compiling information from heterogeneous sources such as newspapers, media archives, and technical or scientific reports (Guzzetti et al. 1994 ; Hervás 2013 ; Klose et al. 2016 ). They can record all known landslide events and cover periods of up to hundreds of years on various scales. For example, global efforts have been made to understand the spatial and temporal trends of rainfall-induced (Kirschbaum et al. 2009 , 2012 ) and fatal landslides (Petley 2012 ; Froude and Petley 2018 ; Haque et al. 2019 ). Also, regional inventories have been compiled using multi-country approaches that combine national records in Europe (Van Den Eeckhaut and Hervás 2012 ; Haque et al. 2016 ), Latin America, and the Caribbean (Sepúlveda and Petley 2015 ). Despite global and regional inventories having made useful contributions to the understanding of landslide events, they have limitations in comprehensively identifying landslides that exhibit complex spatial and temporal patterns (Kirschbaum et al., 2010 ; Petley, 2012 ). Global datasets tend to represent only a fraction of landslide events, focusing on those reported in the international media or fatal landslide events in terms of their impacts and consequences (Spizzichino et al. 2010 ; Sepúlveda and Petley 2015 ). This underestimation results in spatial and temporal gaps, causing global inventories to lose their capacity to accurately represent landslides at the national level. For instance, the Global Fatal Landslide Database (GFLD) reported only 53 fatal landslides in Türkiye (Froude and Petley 2018 ). During the same period, the Fatal Landslides Database of Türkiye (FATALDOT) compiled 191 events (Görüm and Fidan 2021 ). Also, while the Global Landslide Catalog (GLC) recorded only 67 rainfall-induced landslides in Italy (Kirschbaum et al., 2015 ), the newly developed e-ITALICA catalog increased this number to 6,312 (Brunetti et al. 2025 ). Systematically compiled national inventories provide more consistent and standardized records in terms of spatial and temporal coverage. Several countries, for example, Italy (Guzzetti 2000 ; Calvello and Pecoraro 2018 ; Brunetti et al. 2025 ), Colombia (Aristizábal and Sánchez 2020 ; Garcia-Delgado et al. 2022 ), the United States (Mirus et al. 2020 ), China (Lin and Wang 2018 ; Zhang et al. 2023 ), Germany (Damm & Klose, 2015 ), and Türkiye (Fidan and Görüm 2020 rüm and Fidan 2021), have developed national inventories that provide more detailed and reliable landslide records. The use of local language (Sepúlveda and Petley 2015 ) and national sources enables the capture of events that are not usually found in global inventories. Nevertheless, compiling, integrating, and analyzing such inventories often requires a significant amount of time and effort, making their development both labor-intensive and operationally challenging. Over the past few years, the use of digital technologies has led to a fundamental change in the methods of collecting, analyzing, and sharing natural hazard data. Specifically, with the widespread use of the internet, a tremendous amount of information concerning natural hazards is created on online platforms such as news portals, government institution websites, social media, and digital archives (Lai et al. 2022 ; Avcıoğlu et al. 2025 ). In this situation, the web scraping that has been developed as a means for gathering information from the virtual world is, thus, a more rapid, inexpensive, and less labor-intensive alternative than the conventional methods for data collection (Cording 2011 ; Vargiu and Urru 2012 ; Chauhan et al. 2023 ). The rapid growth of digital content and advances in Natural Language Processing (NLP) (Young et al. 2018 ; Kang et al. 2020 ; Raffel et al. 2020 ; Koltsakis et al. 2023 ; Kumar and Renuka 2023 ) have created new opportunities for transitioning from analog methods to automation in collecting data related to natural hazards. Web-based methods, especially web scraping and crawling, combined with NLP, are increasingly being used to automatically develop natural hazard databases by extracting information from large volumes of text (Avcıoğlu et al., 2025 ; Battistini et al., 2013 , 2017 ; Carley et al., 2016 ; Goswami et al., 2018 ; Lausch et al., 2015 ). These techniques enable the systematic collection of data on natural hazards by extracting pertinent information, such as location, date, number of deaths and casualties, and triggers, from online news reports, official reports, social media, and public archives (Battistini et al. 2013 ; Carley et al. 2016 ; Goswami et al. 2018 ). Consequently, these new methods enable the monitoring and recording of natural hazards in near real-time, either to supplement or replace more conventional observational methods. In accordance with new developments, an increasing number of studies have adopted web-based techniques to automatically compile landslide inventories from digital media sources. These approaches have proven effective in capturing events that were missed by traditional inventories and in enhancing the spatial and temporal completeness of landslide records. For example, after searching local newspapers, 111 previously unrecognized events have been added to the UK National Landslide Database, and information regarding the impacts of some 90% of the recognized landslides has been compiled (Taylor et al. 2015 ). Furthermore, large-scale data mining has demonstrated the ability to significantly expand the scope of events, even in municipalities not previously included in Italy's existing inventories (Franceschini et al. 2022 ). Even though web-based inventories play a supporting role for national databases, manual verification is still very important due to location uncertainties, inconsistent terms, and missing information (Battistini et al. 2017 ; Kreuzer and Damm 2020 ). Although it has become possible to detect landslide events from digital news sources automatically, research in this area remains limited in terms of scope and accuracy. For instance, existing landslide inventories generally do not supply spatial resolution beyond the provincial level, and only a small fraction are verified through comparative analysis with manually compiled landslide datasets (Avcıoğlu et al., 2025 ; Franceschini et al., 2022 ). In Türkiye, where landslides are widespread and deadly, a web-based inventory that includes systematic validation is still not available. To address this gap, this study created, validated, and spatially analyzed a national landslide inventory using web scraping methods (Fig. 1 ). Here, a fully automated framework was developed to detect, map, and analyze fatal and non-fatal landslides in Türkiye using online media news sources. The approach captures landslide events by integrating web scraping, NLP, and spatial inference and assigns them to the administrative neighborhoods using geocoding routers when location information is available in media news. The web-based inventory was also validated with a manually compiled national landslide database (Görüm et al., 2025 ) and provides a risk estimation at the neighborhood level. By combining real-time data compiled with spatial accuracy and validity, this study suggests how automated inventories can go beyond incident detection to provide actionable risk information at the local level. 2. Materials and methods We developed a web-scraping algorithm to automatically detect and analyze landslide events in Türkiye from digital media sources (Najatishendi 2025 ). For this purpose, first, URLs were collected from news websites by identifying relevant keywords. During the web scraping process, we examined the HTML structures and extracted key information, including titles, content, and publication dates. We then analyzed the resulting text using natural language processing (NLP) techniques to classify the location, date, number of deaths, and triggering factors. Landslide events were spatially mapped by geocoding place names using Nominatim and Geopy. All extracted information was compiled into a structured dataset and validated against a manually prepared inventory. Finally, we calculated and spatially analyzed a risk estimate that considers landslide probability, exposure, and fatalities (Fig. 2 ). 2.1. Data collection In this study, we utilized web scraping techniques to identify landslide events in digital news sources. In this context, we analyzed the HTML structures of various news websites in Türkiye in detail and developed data extraction strategies suitable for the content presentation style of each source. HTML tags, classes, and ID structures in web pages were considered the main reference points for accurate content extraction. Nevertheless, changes in the HTML structure of websites over time can negatively impact the accuracy of the extraction process. In particular, when a style class name is changed or a content block is moved to a different structure, non-updated systems may not recognize the relevant data, resulting in incomplete or inaccurate data collection. To minimize such problems, we structured the scrapers to be as flexible, traceable, and updatable as possible. While such incompatibilities are less common for sites using structured data standards (e.g., Schema.org), a systematic maintenance process became necessary for resources with non-standard HTML layouts. In the process of collecting URLs, we used a Google search engine scraper to identify news articles related to landslides. The resulting list of links was processed by a second scraper we developed to extract news headlines, publication dates, keywords, and news bodies. At this stage, we used a list of Turkish keywords to pre-filter the content related to landslides. We also identified their English equivalents for comparison with international practices (Table 1 ). Table 1 Keywords (Turkish) used to capture digital news sources on landslides in Türkiye. Keywords are given with their English equivalents. Keywords in Türkiye Keywords in English "çamur akıntısı", "çamur akması", "çamur hareketi", "heyelan", "kaya çökmesi", "kaya devrilmesi", "kaya düşmesi", "kaya hareketi", "kaya kayması", "kaya yuvarlanması", "moloz akıntısı", "moloz akması", "moloz hareketi", "toprak çökmesi", "toprak hareketi", "toprak kayması", "toprak sürüklenmesi", "yamaç kayması", "yer çökmesi", "yer hareketleri", "yer kayması", "zemin çökmesi", "zemin hareketi", "zemin kayması" "mud flow", "mudflow (mudslide)", "mud movement", "landslide", "rock collapse", "rock overturning", "rock fall", "rock movement", "rock sliding", "rock rolling", "debris flow", "debris flow", "debris movement", "soil collapse", "soil movement", "soil sliding", "soil dragging", "slope sliding", "ground substance", "ground movements", "ground sliding", "ground collapse", "ground movement", "ground sliding" 2.2. Natural Language Processing (NLP) NLP, a branch of artificial intelligence (AI), is one of the driving forces supporting a computer's ability to understand, analyze, and produce human language (Jurafsky and Martin 2020). In the disaster news analysis perspective, NLP is a core technology that turns non-structured text data into structured and analyzable input (Bird et al. 2009 ). News articles that were already collected are turned to NLP techniques to extract details automatically, such as the event location (city, district, village, neighborhood), the number of casualties (dead, injured, missing), the kind of landslide (e.g., heavy precipitation, geological instability), the date of the event, and, if the situation appears several times, the total number of landslides (Manning et al. 2014 ). Python is the programming language of choice in this area due to its simplicity and the extensive availability of NLP and web scraping libraries. Web scraping tools, such as BeautifulSoup, Scrapy, and Requests, are the three most commonly used for extracting and manipulating web content. As content is collected, it is passed through a series of NLP tasks, including text tokenization, named entity recognition (NER), event discovery, and geocoding (Bird et al. 2009 ; Manning et al. 2014 ). 2.3. Geocoding We used an open-source geocoding approach to automatically obtain geographic coordinates (latitude and longitude) from Turkish address data (Chow et al. 2016 ; Kilic et al. 2023 ). For this purpose, we integrated the Nominatim service, which uses the OpenStreetMap (OSM) infrastructure, into the Python environment through the geopy library. During the coding process, we structured each address record to form four-level address combinations, considering neighborhood, village, district, and province components. Four levels of address combinations, ranked from the most comprehensive to the simplest format, were gradually applied: neighborhood-village-district-province, village-district-province, district-province, and province only. When the most detailed administrative unit-level information was insufficient, larger administrative units were referenced in the news to verify localization (e.g., Battistini et al. 2013 ; Froude and Petley 2018 ). Our multi-stage query approach enabled location determination even for records with missing or incomplete address information. For each successful match, we added the coordinate information (settlement center) to the relevant record, and in cases where no match was obtained, we left the field blank. Also, we supported our approach with retry and delayed processing mechanisms to avoid connection problems and timeout errors. We systematically applied this methodology to a table dataset. For each address record, we tested four different combinations in sequence and added latitude-longitude information as new columns to the dataset. After the process was complete, we exported the file containing all records enriched with coordinate information as a separate Excel spreadsheet. This approach enabled the capture and processing of large spatial landslide datasets in a low-cost and reproducible manner. 2.4. Validation of the web-scraped inventory To assess the reliability of the web-based landslide inventory, we performed a systematic validation using a previously manually compiled reference landslide inventory for Türkiye (Görüm et al., 2025 ). The validation was performed for 2010–2020, as older online news reports are often removed from the internet and thus become unavailable for web-scraping approaches. Furthermore, the manual inventory only extends up to 2020, making this period most suitable for a consistent and comprehensive comparison. Due to uncertainty in spatial accuracy, the validation focused only on temporal consistency. For each landslide event in the web-based inventory, a match was accepted if at least one event in the manual inventory occurred within a window of 7 days before and 7 days after the web-based event date (i.e., ± 7 days) (Taylor et al. 2015 ; Battistini et al. 2017 ). The analysis was also run for ± 1-day and ± 2-day time intervals to test sensitivity to this parameter. A wider window, more than such as ± 7 days, was not used to avoid matching unrelated events and artificially inflating the agreement between inventories (Battistini et al. 2017 ). We performed the validation process in three steps. First, we compared each event collected from the web with the manual inventory. Here, if the manual event was recorded within ± n days of the date of the event collected from the web, it was counted as a true positive (TP). Next, we defined false positives (FP) as events collected from the web for which no manual record was found within the window. Finally, we classified false negatives (FN) as manual events for which the event collected from the web did not occur within the window (Lai et al. 2022 ; Bhuyan et al. 2023 ). To quantify the web-scraped list's performance compared to the manual inventory, we measured three standard performance metrics: precision, recall, and F1-score. Precision is defined as the ratio of web scraping events found in the manual inventory that reflect the accuracy of perceived events. Recall is the ratio of events successfully identified by web scraping to those in the manual inventory, reflecting the completeness of detection. The F1-score represents the harmonic mean of precision and recall as a single metric that weighs accuracy and completeness evenly (Yacouby and Axman 2020 ; Lai et al. 2022 ; Bhuyan et al. 2023 ). The metrics were computed as follows: $$\:Precision\:=TP\:/\:(TP\:+FP)\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\left(1\right)$$ $$\:Recall\:=TP\:/\:(TP\:+FN)\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\left(2\right)$$ $$\:F1=\:2\:x\frac{Precision\:x\:Recall}{Precision\:+\:Recall}\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\left(3\right)$$ where TP, FP, and FN represent true positives, false positives, and false negatives, respectively. 2.5. Risk estimation To estimate landslide risk at the neighborhood level, we developed a method that integrates landslide probability, population exposure, and recorded fatalities. In the first step, we calculated the probability of at least one landslide occurring during the 27 years of the inventory: $$\:{P}_{L}=\frac{{Landslides}_{i}\:}{inventory\:period}\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\left(4\right)$$ where Landslides i is the total number recorded in neighborhood i over the past 27 years. Next, we computed population exposure (van Westen et al. 2006 ; Corominas et al. 2014 ; Maes et al. 2017 ) as the product of the probability of landslides and the normalized population (Lebakula et al. 2024 ) of each neighborhood, based on the area. $$\:{Exposure}_{i}={P}_{L}\:x\:\left(\frac{{Population}_{i}}{{Area}_{i}}\right)\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\left(5\right)$$ where Exposure i represents the population exposure for neighborhood i, P L is the probability of at least one landslide occurring within 27 years, Exposure i is the total population of the neighborhood, and Area i is the area in square kilometers of the neighborhood. Area normalization fixes spatial heterogeneity and prevents overestimation of exposure in large administrative units. Finally, we estimated the risk of landslides (Varnes 1984 ; Corominas et al. 2014 ) at the neighborhood level using two different approaches, depending on whether fatalities were recorded in each neighborhood. In neighborhoods where landslide-related fatalities ( Fatalities i ) were recorded, we defined risk as the ratio of total fatalities to the estimated exposure: $$\:{Risk}_{i}=\:\frac{{Fatalities}_{i}}{{Exposure}_{i}}\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\left(6\right)$$ On the other hand, in neighborhoods where no deaths were recorded, we calculated a proxy risk estimate using relative landslide frequency and exposure. Here, Landslides i shows the total number of landslides recorded in the neighborhood, whereas Total L shows the total number of landslides recorded in all neighborhoods. $$\:{Risk}_{i}=\:\left(\frac{{Landslides}_{i}}{{Total}_{L}}\:\right)\:x\:\:{Exposure}_{i}\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\left(7\right)$$ We also applied a median-based normalization to ensure comparability between fatal and non-fatal neighborhoods. We calculated the proxy risk values for all neighborhoods with no recorded fatalities, then calculated the median of these values ( Median Nonfatal ). Then, we rescaled the risk value of non-fatal neighborhoods by multiplying it by the ratio of the median risk to Median Nonfatal in fatal neighborhoods ( Median Fatal ): $$\:{Risk}_{iN}=\:{Risk}_{i}\:x\:\left(\frac{{Median}_{Fatal}}{{Median}_{Nonfatal}}\right)\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\left(8\right)$$ This scaling adjusts the two risk distributions and regulates potential bias in non-fatal areas (Fig. 9 b). Although landslide risk is calculated using two different formulations based on the presence or absence of fatalities, both outputs are referred to as Risk i for uniformity and comprehensibility in the manuscript. Also, we categorized risk areas as very high, high, moderate, low, and very low risk (van Westen et al. 2006 ). 3. Result Using our web scraping algorithm, we captured 3051 news articles from across Türkiye between 1997 and 2024. We then separated the texts according to event date, location, and content attributes. Blank or incorrect province, district, and neighborhood names were corrected according to administrative standards. We applied a filter that included text classification and date-location consistency checks to remove duplicate news articles and those that did not report actual landslide events. Similar records captured within a ± 3-day window at the same location -province, district, neighborhood - were considered duplicates and removed from the dataset. Ultimately, we labeled 1727 news articles as records reporting an actual landslide event. Our new database provides information on the ID, date, latitude, longitude, location (i.e., region, subregion, province, district, and neighborhood), number of deaths and injuries, and triggering factor (e.g., natural or anthropogenic) for each landslide event scraped from the web. We also classified 212 news articles that resulted in 478 deaths as fatal landslides. In determining the location of each landslide event, we matched the location names reported in news articles to the corresponding existing settlement centers. In this respect, we located 66.5% of the 1727 landslide events at the neighborhood/village level (n = 1149), 30.6% at the district level (n = 528), and 2.9% at the province level only (n = 50). We reported the location accuracy of 66.5% of total landslides in a neighborhood/village administrative unit with an average planimetric width of 16 km² and a variation of 26 km² measured in one standard deviation. For the other 33.5%, location accuracy reaches up to a mean planimetric width of 1000 km². 3.1. Quantitative assessment of validation We tested the accuracy of the web-scraped landslide inventory by comparing it with a manually compiled inventory. We established a temporal agreement with tolerance levels of ± 1, ±2, ± 3, ±5, and ± 7 days, considering the event dates. Analysis for the years 2010–2020 revealed that the choice of temporal tolerance had a significant influence on agreement measures between the two inventories. Table 2 and Fig. 3 provide the detailed results for the three temporal tolerance levels. Table 2 Performance metrics for different temporal tolerance windows (2010–2020) Tolerance (days) TP FP FN Precision Recall F1-score ± 1 636 202 1455 0.759 0.304 0.434 ± 2 712 126 1379 0.850 0.341 0.486 ± 3 757 81 1334 0.903 0.362 0.517 ± 5 796 42 1295 0.950 0.381 0.544 ± 7 808 30 1283 0.964 0.386 0.552 Note : True Positives (TP), False Positives (FP), and False Negatives (FN). Using a ± 1-day window, 636 web-scraped events were matched to manual records (TP), with 202 events unmatched (FP), and 1455 manual events not detected by web scraping (FN). The precision, recall, and F1-score that correspond with these were 0.759, 0.304, and 0.434, respectively. As the window widened to ± 5 and ± 7 days, these metrics improved notably, with the ± 7-day window providing the highest performance levels (precision = 0.964, recall = 0.386, F1 = 0.522, see Table 2 and Fig. 3 ). Since the F1-score is the harmonic mean of precision and recall, low recall limits the F1-score upward. When ± 7 days are reached, precision is already close to 0.96, so the only way to increase the F1-score further is to improve recall significantly. However, as tolerance widens, additional matching gains exhibit diminishing returns. Hence, the increase in the F1-score remains limited (from 0.486 to 0.552). 3.2. Temporal distribution Covering the 27-year period from 1997 to 2024, the annual number of landslides remained low until 2007, but started to exhibit its first increase as of 2008. The annual number of landslides exceeded 50 and 100, respectively, in 2014 and 2018. By 2024, the number of landslides reached its highest with 276 (Fig. 4 a). Despite annual variations, the number of landslides generally shows an increasing trend. Consequently, we documented that an average of 64 landslides occurred annually in Türkiye between 1997 and 2024. The temporal distribution of landslide fatalities is not uniform (Fig. 4 b). Until 2008, the number of deaths was mostly less than 10. Although the number of deaths began to increase in 2008, it remained low between 2009 and 2015. Between 2016 and 2024, more deaths were reported than in previous years. Particularly, two sharp increases were recorded in 2016 (n = 77) and 2023 (n = 63). These peaks and irregular fluctuations in the annual death rate are related to individual events causing multiple deaths. For example, in a single landslide event in 2016 and 2023, 44 and 15 people died, respectively. Over a 27-year period, landslides in Türkiye have caused 478 deaths, which is an average of 18 people per year. Figure 5 shows the spatial distribution of landslide events on a monthly basis. Between 1997 and 2024, landslide events were concentrated in the winter season (31.2%, n = 538). The number of events, which was relatively low at 127 in December, increased to 199 in January and then to 212 in February, reaching the highest frequency (Fig. 5 a and b). However, the winter season is also the period with the lowest rate of deaths — 14%, n = 67 — (Fig. 5 c and d). In spring and summer, which account for 25.8% (n = 445) and 24.6% (n = 424) of the total number of incidents, respectively (Fig. 5 a and b), approximately 56% of total deaths were recorded — spring: n = 125, 26.2%; summer: n = 142, 29.7% — (Fig. 5 c and d). Although autumn accounted for only 18.5% of total landslide events (n = 320), it was the season with the highest mortality rate (30.1%; n = 144). 3.3. Spatial distribution and triggering factors Our web scraping algorithm assigns the administrative unit mentioned in the news content to the settlement center when determining the exact location of landslide events. Therefore, since there is a margin of error (minimum average 16 km²) equal to the area of the most detailed administrative unit mentioned in the news content, landslides are also grouped regionally and by province. Here, we analyze the spatial distribution of landslides (Fig. 6 ) and fatalities (Fig. 7 ) at the regional and provincial levels. Regionally, the Black Sea area has the highest concentration of landslides (38.2%, n = 659), while the Southeastern Anatolia region has the lowest number (5.8%, n = 101). In other regions, landslides occurred at a rate of 18.6% (n = 321) in Marmara, 12.3% (n = 212) in Eastern Anatolia, 10% (n = 172) in the Mediterranean, 9% (n = 155) in the Aegean, and 6.2% (n = 107) in Central Anatolia (Fig. 6 a and b). Türkiye has 81 provinces, and at least one landslide event has been recorded in all of them. Istanbul is the province with the highest number of landslide events (n = 180) and accounts for 10% of the records. Following Istanbul, the provinces of Rize (n = 97), Trabzon (n = 94), Artvin (n = 91), Ordu (n = 67), and Zonguldak (n = 554) stand out as the provinces with the highest landslide frequency (Fig. 6 c and d). At the regional level, landslide fatalities are also most frequently observed in the Black Sea region. A total of 210 deaths have been recorded in the Black Sea region, accounting for 44% of Türkiye's landslide fatalities (Fig. 7 a and b). This is followed by the Marmara (12.3%, n = 59), Mediterranean (11.7%, n = 56), Southeastern Anatolia (10.7%, n = 51), Central Anatolia (9.2%, n = 44), and Eastern Anatolia (6.9%, n = 33) regions. The lowest number of deaths was recorded in the Aegean region (5.2%, n = 25). Considering landslide fatalities, at least one death has been recorded in 59 of the 81 provinces (Fig. 7 c and d). Trabzon has the highest fatality rate (77, 16.1%). Although Istanbul ranks first in terms of the number of landslides (n = 180), the number of fatalities (n = 29) is relatively low, placing it second. Following these are Rize (n = 28), Kastamonu (n = 28), Adana (n = 24), Adıyaman (n = 21), Artvin (n = 16), and Sivas (n = 15). These provinces, where landslide deaths are most concentrated, account for approximately 50% of total deaths. Our web scraping algorithm can also identify the triggering factors of landslide events scraped from the web (Fig. 8 ). We have defined the triggering factors of landslides into three categories: (i) natural (e.g., rainfall, snowmelt), (ii) anthropogenic (e.g., construction, infrastructure, mining), and (iii) N/A (not available). Among the classified landslide events, natural triggers accounted for the largest group with 1173 events, followed by anthropogenic triggers with 350 events. Also, 204 events were assigned to the N/A (not available) category due to the absence of trigger information. 3.4. Distribution of risk estimates To examine the distribution of landslides across the country in further detail, we estimated and compared landslide risk in neighborhoods nationwide by incorporating landslide events and deaths into a combined index. Risk estimations are mainly concentrated in the low (n = 292) and moderate (n = 230) categories, while the higher risk categories, high (n = 171) and very high (n = 122), include relatively fewer neighborhoods (Fig. 9 d). The distribution of very high-risk neighborhoods is highest in Istanbul (n = 66), followed by Rize, Zonguldak, Ankara, and Trabzon (Fig. 9 a and Table S1). Istanbul also contains the largest number of high-risk neighborhoods (n = 21), bringing the total number of neighborhoods in the top two risk categories to 87, more than any other province in the country (Table S1). Beyond their spatial distribution, we analyzed whether the risk values exhibited consistent patterns across fatal and non-fatal neighborhoods. Standardized values of risk showed comparably distributed after median rescaling for fatal and non-fatal neighborhoods (Fig. 9 b). Though derived from varied formulas, both groups had a similar distribution and central tendency in their log values. The standardized distribution of the risk values approximated a normal distribution with slight skewness after scaling and transformation (Fig. 9 c). 4. Discussion In this study, a landslide inventory automatically obtained from web sources using web scraping and natural language processing (NLP) techniques is presented. The web scraping algorithm identified 3051 news articles between 1997 and 2024. With the processing of this data, 1727 landslide events were confirmed, and 478 deaths caused by 212 landslides were identified. High spatial accuracy was achieved by locating 66.5% of landslide events at the neighborhood/village level. Furthermore, when we compared it to the manually compiled national inventory, it showed moderate agreement (F1 = 0.552) within a ± 7-day tolerance window. Our findings show that using web scraping with free and publicly available online sources is a good way to build archive inventories and provides a reliable data source that complements traditional inventories. Previous studies have shown that landslide inventories created using web-based approaches are located at the district and neighborhood/village level (Battistini et al. 2013 ; Franceschini et al. 2022 ; Avcıoğlu et al. 2025 ). However, we identified 66.5% of the landslide events in administrative units with an average planimetric width of 16 km², directly corresponding to the settlement centers of these units. The main reason for this positioning is that landslide news is generally reported when it affects human life, settlements, or infrastructure (Moeller 2006 ; Allan et al. 2013 ). Therefore, we consider that locating landslides in relation to the settled area rather than the center of the administrative unit (the center of the polygon) will increase location accuracy. This approach has enabled risk estimation by providing high spatial accuracy at the neighborhood/village level (Fig. 9 ). Our web-scraping algorithm also provides more detailed information about landslide reports. For example, by extracting fatality information, fatal and non-fatal landslides were distinguished (Figs. 6 and 7 ), and the triggering factors (natural or anthropogenic) of landslides were identified (Fig. 8 ). Natural language processing (NLP) techniques have made it possible to identify fatal and non-fatal incidents by scanning news content for expressions such as “death, injury, loss of life” and numerical information. Similarly, using keywords and contexts such as “rain, snowmelt, construction, road work, mining,” the triggering factors of landslides have been classified as natural or anthropogenic. As a result, our analysis shows that the web-scraping method not only captures data but also contributes to landslide hazard and risk studies by enhancing data attributes. However, since the source of the data is news articles, which events are reported and what details are included in these reports largely depend on media practices. Data collected from the web is valuable for research into natural hazards such as landslides, but it also carries inherent biases (Taylor et al. 2015 ; Kreuzer and Damm 2020 ). In particular, the language used in natural hazard news can be ambiguous and have multiple meanings. For example, terms like “collapse” or “slide” may not refer to an actual hazard but be used metaphorically (e.g., “the collapse of the economy,” “the team's slide”). Such metaphorical or incorrect uses can cause problems in NLP and text mining processes, leading models to mistakenly interpret these expressions as disaster events (Bird et al. 2009 ; Jurafsky and Martin 2020). Our results show that the web scraping method is accurate in terms of the events it detects, but its coverage is limited. High precision values (± 1 day = 0.76, ± 3 days = 0.90, and ± 7 days = 0.96) support the automatic approach's ability to accurately capture landslides reported in the media (Fig. 3 and Table 2 ). In contrast, the low ratio of events successfully identified by web scraping to events in the manual inventory (recall value, ± 1 day = 0.30, ± 3 days = 0.36, and ± 7 days = 0.39) reflects the limited scope of news sources (Taylor et al. 2015 ) rather than the performance of the method. For example, differences in the style of reporting news details and the reliability of information from different newspaper articles (Lai et al. 2022 ). Comparisons with different inventories highlight the potential and limitations of the automated method more clearly. The 1727 landslides we captured between 1997 and 2014 are very close to the 1843 landslides reported in the inventory of Avcıoğlu et al. ( 2025 ), which covers almost the same period (1997–2023). In fact, this overlap demonstrates that web scraping has high potential for capturing landslide events. Similarly, the 212 fatal landslides identified by our algorithm for the period 1997–2024 are the same as the number of events reported by Görüm and Fidan ( 2021 ) in their Fatal Landslide Database of Türkiye (FATALDOT) inventory for 1997–2019. Also, the manually compiled FATALDOT inventory, which also includes historical records, has reported a total of 389 fatal landslides over a more extended period (1929–2019). Nevertheless, while 2091 landslides were recorded in the inventory used for validation covering the same period by Görüm et al. ( 2025 ), only 838 events could be detected using the web scraping method. This difference shows that there are limitations in terms of the complete recording of all events and that it also provides a measurable method for error rates. In particular, the inability to include events that are not covered by small-scale or local media (Carrara et al. 2003 ; Guzzetti and Tonelli 2004 ) in the inventory shows that the database created by web scraping is a complementary tool rather than a substitute for manual inventory. Therefore, while web scraping is a powerful tool for systematically capturing current and recent data (e.g., Innocenzi et al. 2017 ; Kreuzer and Damm 2020 ; Franceschini et al. 2022 ), manual inventories perform a complementary function in terms of historical landslide events (e.g., Guzzetti 2000 ; Guzzetti et al. 2005 ). Overall, our study shows that web scraping and natural language processing (NLP) techniques offer great potential for natural hazard inventories. However, the ability to produce a nearly complete inventory of this potential is not limited to technical algorithmic improvements. For this purpose, news content related to natural hazards such as landslides, floods, avalanches, fires, and sinkholes must be presented in a more consistent and structured format according to international standards. For example, reporting basic elements such as the time, location, magnitude, effects, and casualties of an event in a standardized manner will enable these techniques to work more comprehensively and accurately. In this context, cooperation between media organizations, disaster management agencies, and the scientific community, supported by international organizations such as United Nations Educational, Scientific and Cultural Organization (UNESCO), United Nations Office for Disaster Risk Reduction (UNDRR), and World Meteorological Organization (WMO), is critical for establishing common standards for disaster reporting. Such an initiative will enable faster and more accurate recording of not only landslides but all natural hazards on a global scale, thereby making substantial contributions to scientific research as well as risk management and early warning systems. 5. Conclusion This study demonstrates the potential of automated approaches based on digital media news in producing landslide archive inventories. Our approach, which combines web scraping, natural language processing (NLP), and geocoding techniques, verified 1727 landslides from 3051 news articles covering the period 1997–2024, determining that 212 of these were fatal and resulted in a total of 478 deaths. This approach not only captures landslide events but also automatically filters out fatal and non-fatal cases, the number of casualties, and triggering factors, as well as other key attributes. Our results have also provided substantial outputs in terms of spatial accuracy. Locating 66.5% of landslide events at the neighborhood/village level has enabled detailed resolution that can be used for risk prediction assessments. Comparisons with manually compiled national inventories show that F1 scores ranging from 0.434 to 0.552 obtained within time windows of ± 1 to ± 7 days represent an acceptable method. These findings suggest that web-based automated approaches can perform a complementary and extended alternative function to traditional inventories. However, the limitations of the method need to be considered. As the location information in news articles primarily refers to settlement names, this results in the exact location of landslides being underrepresented. The linguistic diversity of news reports, differences in terminology, and the lack of media coverage of events can also create gaps. Therefore, it is important to integrate web-based inventories with supporting information sources and improve text mining algorithms in order to provide a more comprehensive representation. Also needed are efforts to improve multilingual data processing capabilities and reduce reporting biases. Consequently, this study shows that the integration of web scraping, natural language processing (NLP), and geocoding techniques can be an alternative to traditional landslide archive inventories, offering low-cost, scalable, and near-real-time updates, especially at the national scale. Future research, algorithm improvements, and initiatives to standardize the reporting of natural hazard news will further develop this potential. Thus, it is not only applicable to Türkiye, but can also be adapted to different languages and applied to various countries or on a global scale. In this context, substantial contributions can be made to landslide risk management, early warning systems, and scientific research. Declarations Acknowledgments This study is derived from the first author’s doctoral thesis conducted at Yildiz Technical University/Türkiye. T.G. acknowledges support from the Scientific and Technological Research Council of Türkiye (TUBITAK) under 2247-A National Outstanding Researchers Program grant number 123C512. The authors thank Dr. Ugur Ozturk for his support during the risk analysis process. Competing interests The authors declare no competing interests. Contributions EN is the corresponding author and contributed to the methodology, data collection, investigation, and formal analysis. EN wrote the first draft of the paper. SF and TG contributed to the visualization, validation, risk analysis, and writing of the manuscript. TG and FB contributed to the writing, supervision, and review of the manuscript. All authors contributed to the interpretation of the results, editing, and revision of the manuscript. Code and data availability The data and code for the research can be accessed by https://github.com/Elnaz66/webscrap (Najatishendi 2025). References Allan S, Adam B, Carter C (2013) Introduction The media politics of environmental risk. In: Environmental risks and the media. Routledge, pp 1–26 Aristizábal E, Sánchez O (2020) Spatial and temporal patterns and the socioeconomic impacts of landslides in the tropical and mountainous Colombian Andes. Disasters 44:596–618. https://doi.org/10.1111/disa.12391 Avcıoğlu A, Demir O, Görüm T (2025) An automated approach for developing geohazard inventories using news : Integrating NLP , machine learning , and mapping . 2015:1–21 Battistini A, Rosi A, Segoni S, et al (2017) Validation of landslide hazard models using a semantic engine on online news. Appl Geogr 82:59–65. https://doi.org/10.1016/j.apgeog.2017.03.003 Battistini A, Segoni S, Manzo G, et al (2013) Web data mining for automatic inventory of geohazards at national scale. Appl Geogr 43:147–158. https://doi.org/10.1016/j.apgeog.2013.06.012 Bhuyan K, Tanyaş H, Nava L, et al (2023) Generating multi-temporal landslide inventories through a general deep transfer learning strategy using HR EO data. Sci Rep 13:1–26. https://doi.org/10.1038/s41598-022-27352-y Bird S, Klein E, Loper E (2009) Natural language processing with Python: analyzing text with the natural language toolkit. ‘ O’Reilly Media, Inc.’ Brunetti MT, Gariano SL, Melillo M, et al (2025) An enhanced rainfall-induced landslide catalogue in Italy. Sci data 12:216. https://doi.org/10.1038/s41597-025-04551-6 Caleca F, Lombardo L, Steger S, et al (2025) Pan-European Landslide Risk Assessment: From Theory to Practice. Rev Geophys 63:1–45. https://doi.org/10.1029/2023RG000825 Calvello M, Pecoraro G (2018) FraneItalia: a catalog of recent Italian landslides. Geoenvironmental Disasters 5:. https://doi.org/10.1186/s40677-018-0105-5 Carley KM, Malik M, Landwehr PM, et al (2016) Crowd sourcing disaster management: The complex nature of Twitter usage in Padang Indonesia. Saf Sci 90:48–61. https://doi.org/10.1016/j.ssci.2016.04.002 Carrara A, Crosta G, Frattini P (2003) Geomorphological and historical data in assessing landslide hazard. Earth Surf Process Landforms 28:1125–1142. https://doi.org/10.1002/esp.545 Chauhan R, Negi A, Manchanda M (2023) An Extensive Review on Web Scraping Technique using Python. Proc 2023 2nd Int Conf Augment Intell Sustain Syst ICAISS 2023 1134–1138. https://doi.org/10.1109/ICAISS58487.2023.10250745 Chow TE, Dede-Bamfo N, Dahal KR (2016) Geographic disparity of positional errors and matching rate of residential addresses among geocoding solutions. Ann GIS 22:29–42. https://doi.org/10.1080/19475683.2015.1085437 Cording PH (2011) Algorithms for Web Scraping. 104 Corominas J, van Westen C, Frattini P, et al (2014) Recommendations for the quantitative analysis of landslide risk. Bull Eng Geol Environ 73:209–263. https://doi.org/10.1007/s10064-013-0538-8 Damm B, Klose M (2015) The landslide database for Germany: Closing the gap at national level. Geomorphology 249:82–93. https://doi.org/10.1016/j.geomorph.2015.03.021 Depicker A, Jacobs L, Mboga N, et al (2021) Historical dynamics of landslide risk from population and forest-cover changes in the Kivu Rift. Nat Sustain. https://doi.org/10.1038/s41893-021-00757-9 Emberson R, Kirschbaum D, Amatya P, et al (2022) Insights from the topographic characteristics of a large global catalog of rainfall-induced landslide event inventories. Nat Hazards Earth Syst Sci Discuss 1–33 Fang Z, Tanyas H, Gorum T, et al (2023) Speech-recognition in landslide predictive modelling: A case for a next generation early warning system. Environ Model Softw 170:105833. https://doi.org/10.1016/j.envsoft.2023.105833 Fidan S, Görüm T (2020) Türkiye’de Ölümcül Heyelanların Dağılım Karakteristikleri ve Ulusal Ölçekte Öncelikli Alanların Belirlenmesi. Türk Coğrafya Derg 74:123–134. https://doi.org/10.17211/tcd.731596 Fidan S, Tanyaş H, Akbaş A, et al (2024) Understanding fatal landslides at global scales: a summary of topographic, climatic, and anthropogenic perspectives. Nat Hazards 120:6437–6455. https://doi.org/10.1007/s11069-024-06487-3 Franceschini R, Rosi A, Catani F, Casagli N (2022) Exploring a landslide inventory created by automated web data mining: the case of Italy. Landslides 19:841–853. https://doi.org/10.1007/s10346-021-01799-y Froude MJ, Petley DN (2018) Global fatal landslide occurrence from 2004 to 2016. Nat Hazards Earth Syst Sci 18:2161–2181. https://doi.org/10.5194/nhess-18-2161-2018 Garcia-Delgado H, Petley DN, Bermúdez MA, Sepúlveda SA (2022) Fatal landslides in Colombia (from historical times to 2020) and their socio-economic impacts. Landslides 19:1689–1716. https://doi.org/10.1007/s10346-022-01870-2 Gómez D, García EF, Aristizábal E (2023) Spatial and temporal landslide distributions using global and open landslide databases. Springer Netherlands Görüm T, Bozkurt D, Korup O, et al (2025) The 2023 Türkiye-Syria earthquake disaster was exacerbated by an atmospheric river. Commun Earth Environ 6:1–10. https://doi.org/10.1038/s43247-025-02111-9 Gorum T, Fan X, van Westen CJ, et al (2011) Distribution pattern of earthquake-induced landslides triggered by the 12 May 2008 Wenchuan earthquake. Geomorphology 133:152–167. https://doi.org/10.1016/j.geomorph.2010.12.030 Görüm T, Fidan S (2021) Spatiotemporal variations of fatal landslides in Turkey. 1691–1705. https://doi.org/10.1007/s10346-020-01580-7 Goswami S, Chakraborty S, Ghosh S, et al (2018) A review on application of data mining techniques to combat natural disasters. Ain Shams Eng J 9:365–378. https://doi.org/10.1016/j.asej.2016.01.012 Guns M, Vanacker V (2014) Shifts in landslide frequency-area distribution after forest conversion in the tropical Andes. Anthropocene 6:75–85. https://doi.org/10.1016/j.ancene.2014.08.001 Guzzetti F (2000) Landslide fatalities and the evaluation of landslide risk in Italy. Eng Geol 58:89–107. https://doi.org/10.1016/S0013-7952(00)00047-8 Guzzetti F, Cardinali M, Reichenbach P (1994) The AVI project: A bibliographical and archive inventory of landslides and floods in Italy. Environ Manage 18:623–633. https://doi.org/10.1007/BF02400865 Guzzetti F, Gariano SL, Peruccacci S, et al (2020) Geographical landslide early warning systems. Earth-Science Rev 200:102973. https://doi.org/10.1016/j.earscirev.2019.102973 Guzzetti F, Mondini AC, Cardinali M, et al (2012) Landslide inventory maps: New tools for an old problem. Earth-Science Rev 112:42–66. https://doi.org/10.1016/j.earscirev.2012.02.001 Guzzetti F, Stark CP, Salvati P (2005) Evaluation of flood and landslide risk to the population of Italy. Environ Manage 36:15–36. https://doi.org/10.1007/s00267-003-0257-1 Guzzetti F, Tonelli G (2004) Information system on hydrological and geomorphological catastrophes in Italy (SICI): A tool for managing landslide and flood hazards. Nat Hazards Earth Syst Sci 4:213–232. https://doi.org/10.5194/nhess-4-213-2004 Haque U, Blum P, da Silva PF, et al (2016) Fatal landslides in Europe. Landslides 13:1545–1554. https://doi.org/10.1007/s10346-016-0689-3 Haque U, da Silva PF, Devoli G, et al (2019) The human cost of global warming: Deadly landslides and their triggers (1995–2014). Sci Total Environ 682:673–684. https://doi.org/10.1016/j.scitotenv.2019.03.415 Hervás J (2013) Landslide Inventory. In: Bobrowsky PT (ed) Encyclopedia of Natural Hazards. Springer Netherlands, Dordrecht, pp 610–611 Innocenzi E, Greggio L, Frattini P, de Amicis M (2017) A Web-Based Inventory of Landslides Occurred in Italy in the Period 2012--2015. In: Mikos M, Tiwari B, Yin Y, Sassa K (eds) Advancing Culture of Living with Landslides. Springer International Publishing, Cham, pp 1127–1133 Jurafsky M (2020) Speech and Language Processing An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models Third Edition draft Summary of Contents. vii–x Kang Y, Cai Z, Tan CW, et al (2020) Natural language processing (NLP) in management research: A literature review. J Manag Anal 7:139–172. https://doi.org/10.1080/23270012.2020.1756939 Kilic B, Hacar M, Gülgen F (2023) Effects of reverse geocoding on OpenStreetMap tag quality assessment. Trans GIS 27:1599–1613. https://doi.org/10.1111/tgis.13089 Kirschbaum D, Adler R, Adler D, et al (2012) Global Distribution of Extreme Precipitation and High-Impact Landslides in 2010 Relative to Previous Years. J Hydrometeorol 13:1536–1551. https://doi.org/10.1175/JHM-D-12-02.1 Kirschbaum D, Stanley T, Zhou Y (2015) Spatial and temporal analysis of a global landslide catalog. Geomorphology 249:4–15. https://doi.org/10.1016/j.geomorph.2015.03.016 Kirschbaum DB, Adler R, Hong Y, et al (2010) A global landslide catalog for hazard applications: method, results, and limitations. Nat Hazards 52:561–575. https://doi.org/10.1007/s11069-009-9401-4 Kirschbaum DB, Adler R, Hong Y, Lerner-Lam A (2009) Evaluation of a preliminary satellite-based landslide hazard algorithm using global landslide inventories. Nat Hazards Earth Syst Sci 9:673–686. https://doi.org/10.5194/nhess-9-673-2009 Klose M, Maurischat P, Damm B (2016) Landslide impacts in Germany: A historical and socioeconomic perspective. Landslides 13:183–199. https://doi.org/10.1007/s10346-015-0643-9 Koltsakis E, Klontzas ME, Karantanas AH (2023) What Is Artificial Intelligence: History and Basic Definitions Kreuzer TM, Damm B (2020) Automated digital data acquisition for landslide inventories. Landslides 17:2205–2215. https://doi.org/10.1007/s10346-020-01431-5 Kumar LA, Renuka DK (2023) State-of-the-Art Natural Language Processing. Deep Learn Approach Nat Lang Process Speech, Comput Vis 49–75. https://doi.org/10.1201/9781003348689-3 Lai K, Porter JR, Amodeo M, et al (2022) A Natural Language Processing Approach to Understanding Context in the Extraction and GeoCoding of Historical Floods, Storms, and Adaptation Measures. Inf Process Manag 59:102735. https://doi.org/10.1016/j.ipm.2021.102735 Lausch A, Schmidt A, Tischendorf L (2015) Data mining and linked open data - New perspectives for data analysis in environmental research. Ecol Modell 295:5–17. https://doi.org/10.1016/j.ecolmodel.2014.09.018 Lebakula V, Epting J, Moehl J, et al (2024) LandScan Silver Edition Lin Q, Wang Y (2018) Spatial and temporal analysis of a fatal landslide inventory in China from 1950 to 2016. Landslides 15:2357–2372. https://doi.org/10.1007/s10346-018-1037-6 Maes J, Kervyn M, de Hontheim A, et al (2017) Landslide risk reduction measures: A review of practices and challenges for the tropics. Prog Phys Geogr 41:191–221. https://doi.org/10.1177/0309133316689344 Manning CD, Bauer J, Finkel J, Bethard SJ (2014) The Stanford CoreNLP Natural Language Processing Toolkit. AclwebOrg 55–60 Mirus BB, Jones ES, Baum RL, et al (2020) Landslides across the USA: occurrence, susceptibility, and data limitations. Landslides 17:2271–2285. https://doi.org/10.1007/s10346-020-01424-4 Moeller SD (2006) ‘Regarding the Pain of Others’: Media, Bias and the Coverage of International Disasters. J Int Aff 59:173–XVI Najatishendi E (2025) Automated extraction of landslide events from Turkish news articles (Version 0.1.0) [Software]. https://github.com/Elnaz66/webscrap Ozturk U, Bozzolan E, Holcombe EA, et al (2022) How climate change and unplanned urban sprawl bring more landslides. Nature 608:262–265. https://doi.org/10.1038/d41586-022-02141-9 Petley D (2012) Global patterns of loss of life from landslides. Geology 40:927–930. https://doi.org/10.1130/G33217.1 Raffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 21:1–67 Rossi M, Guzzetti F, Salvati P, et al (2019) A predictive model of societal landslide risk in Italy. Earth-Science Rev 196:102849. https://doi.org/10.1016/j.earscirev.2019.04.021 Sepúlveda SA, Petley DN (2015) Regional trends and controlling factors of fatal landslides in Latin America and the Caribbean. Nat Hazards Earth Syst Sci 15:1821–1833. https://doi.org/10.5194/nhess-15-1821-2015 Spizzichino D, Margottini C, Trigila A, et al (2010) Chapter 9: landslides. Eur Environ Agency Mapp impacts Nat hazards Technol Accid Eur An Overv last Decad EEA Tech Rep 13:81–93 Tanyaş H, van Westen CJ, Allstadt KE, et al (2017) Presentation and Analysis of a Worldwide Database of Earthquake-Induced Landslide Inventories. J Geophys Res Earth Surf 122:1991–2015. https://doi.org/10.1002/2017JF004236 Taylor FE, Malamud BD, Freeborough K, Demeritt D (2015) Enriching Great Britain’s National Landslide Database by searching newspaper archives. Geomorphology 249:52–68. https://doi.org/10.1016/j.geomorph.2015.05.019 Van Den Eeckhaut M, Hervás J (2012) State of the art of national landslide databases in Europe and their potential for assessing landslide susceptibility, hazard and risk. Geomorphology 139–140:545–558. https://doi.org/10.1016/j.geomorph.2011.12.006 van Westen CJ, van Asch TWJ, Soeters R (2006) Landslide hazard and risk zonation - Why is it still so difficult? Bull Eng Geol Environ 65:167–184. https://doi.org/10.1007/s10064-005-0023-0 Vargiu E, Urru M (2012) Exploiting web scraping in a collaborative filtering- based approach to web advertising. Artif Intell Res 2:44–54. https://doi.org/10.5430/air.v2n1p44 Varnes DJ (1984) Landslide hazard zonation: a review of principles and practice Yacouby R, Axman D (2020) Probabilistic extension of precision, recall, and f1 score for more thorough evaluation of classification models. In: Proceedings of the first workshop on evaluation and comparison of NLP systems. pp 79–91 Young T, Hazarika D, Poria S, Cambria E (2018) Recent trends in deep learning based natural language processing [Review Article]. IEEE Comput Intell Mag 13:55–75. https://doi.org/10.1109/MCI.2018.2840738 Zhang S, Li C, Peng J, et al (2023) Fatal landslides in China from 1940 to 2020: occurrences and vulnerabilities. Landslides. https://doi.org/10.1007/s10346-023-02034-6 Supplementary Files SupplementaryInformation.docx Cite Share Download PDF Status: Published Journal Publication published 26 Dec, 2025 Read the published version in Natural Hazards → Version 1 posted Editorial decision: Major revisions 22 Oct, 2025 Reviewers agreed at journal 28 Aug, 2025 Reviewers invited by journal 28 Aug, 2025 Editor assigned by journal 27 Aug, 2025 First submitted to journal 26 Aug, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7463555","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":507026992,"identity":"699471ef-ee80-47cc-bcdd-3ecd4a134342","order_by":0,"name":"Elnaz Najatishendi","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABV0lEQVRIie3RsWrCQBjA8U8O4nLR9RzEVzgJCAWr9E08Ao6lUBDHBCEuoqtOfQWlUDp+xwd1EZ0L3QKdIy5xshfRaIMPUGj+hCTc5ReSO4C8vL9Y8XQteACYnIAYAljmKILE08yvWErYmVgdwISwlLCb5Hh3JMjlkZgBCTdImdl6G8NXjQ03Wu/fm4+lJd/toh6oMWMLHUOzOsfyR3QhlUHJFQK+6/7IBbJX3ecK2W8C16ACZj2hgK4zR8amFyKJSyGBCr5nSCEgNTcEdJAQLtFMqYRcfVibuBN3gNr+JAS9Dw6G8DBKSQcOWSIZbwg0r/KnLqAdYEJApAQBs0QQb9x5klx/GkqyA1fNyGqI1Vo4yb9oT7rOjJhzvWLDkfMZ9+l+MVHhdh+01HhDYdTvNasvw8HrNu63quOlH0I2s/71zB6LB+88ldnJtFp2oH37uby8vLx/2A/GAog4dFoiGQAAAABJRU5ErkJggg==","orcid":"https://orcid.org/0000-0001-7901-5640","institution":"Yildiz Teknik Universitesi","correspondingAuthor":true,"prefix":"","firstName":"Elnaz","middleName":"","lastName":"Najatishendi","suffix":""},{"id":507026993,"identity":"d2d042e5-c775-427b-b074-1115b22691e8","order_by":1,"name":"Tolga Görüm","email":"","orcid":"","institution":"Istanbul Technical University Eurasia Institute of Earth Sciences: Istanbul Teknik Universitesi Avrasya Yer Bilimleri Enstitusu","correspondingAuthor":false,"prefix":"","firstName":"Tolga","middleName":"","lastName":"Görüm","suffix":""},{"id":507026994,"identity":"7b572db1-c7b2-441f-a083-5870936b30d7","order_by":2,"name":"Seçkin Fidan","email":"","orcid":"","institution":"Ankara University: Ankara Universitesi","correspondingAuthor":false,"prefix":"","firstName":"Seçkin","middleName":"","lastName":"Fidan","suffix":""},{"id":507026995,"identity":"3453702c-1fa8-4cf3-85c7-3051770f59a6","order_by":3,"name":"Fusun Balık Şanlı","email":"","orcid":"","institution":"Yildiz Technical University: Yildiz Teknik Universitesi","correspondingAuthor":false,"prefix":"","firstName":"Fusun","middleName":"Balık","lastName":"Şanlı","suffix":""}],"badges":[],"createdAt":"2025-08-26 13:43:42","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7463555/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7463555/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1007/s11069-025-07753-8","type":"published","date":"2025-12-26T15:57:37+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":90623772,"identity":"7081ad8c-ca53-4d43-bd14-fe7dcd70fea2","added_by":"auto","created_at":"2025-09-04 22:15:39","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":2230850,"visible":true,"origin":"","legend":"\u003cp\u003eLandslide distribution and density across Türkiye. The map shows the spatial distribution of web-scraped landslides (white dots) and the corresponding landslide density surface.\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-7463555/v1/f5912b154227c04784de0812.png"},{"id":90623995,"identity":"eaeee80d-15d8-4e1c-939b-f4fa0503b367","added_by":"auto","created_at":"2025-09-04 22:23:39","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":111693,"visible":true,"origin":"","legend":"\u003cp\u003eThe workflow of the automated system for capturing, processing, and mapping landslide events from digital news sources.\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-7463555/v1/41324484b4514f9dc7cd6629.png"},{"id":90624098,"identity":"6ba42487-7025-451a-bb44-68f0c10d0b16","added_by":"auto","created_at":"2025-09-04 22:31:39","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":436644,"visible":true,"origin":"","legend":"\u003cp\u003eValidation metrics (precision, recall, and F1-score) for different temporal tolerance windows (±1, ±2, ±3, ±5, and ±7 days) during 2010–2020.\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-7463555/v1/1a0e8903346a7e35a14e6cb6.png"},{"id":90623775,"identity":"16d3ac90-da8b-47ad-a6e1-f8caa1ed36a5","added_by":"auto","created_at":"2025-09-04 22:15:39","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":1419109,"visible":true,"origin":"","legend":"\u003cp\u003eTemporal distribution of \u003cstrong\u003e(a)\u003c/strong\u003e landslide events and \u003cstrong\u003e(b)\u003c/strong\u003e fatalities in Türkiye (1997–2024).\u003c/p\u003e","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-7463555/v1/2c610ae56b72d2803a15500f.png"},{"id":90623781,"identity":"ec77e373-d803-41f8-9650-24312cf75b5d","added_by":"auto","created_at":"2025-09-04 22:15:39","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":2283124,"visible":true,"origin":"","legend":"\u003cp\u003eMonthly-based \u003cstrong\u003e(a)\u003c/strong\u003e spatial distribution of all landslide events, \u003cstrong\u003e(b)\u003c/strong\u003e total landslide counts, \u003cstrong\u003e(c)\u003c/strong\u003e spatial distribution of fatal landslide events, and \u003cstrong\u003e(d)\u003c/strong\u003e landslide dead counts. \u003cstrong\u003eNote:\u003c/strong\u003e Winter: December–January–February, Spring: March–April–May, Summer: June–July–August, Autumn: September–October–November.\u003c/p\u003e","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-7463555/v1/85c3b01caa0c50a217257cb8.png"},{"id":90623779,"identity":"a9f97ace-c789-4eca-8334-fcaacdf92a30","added_by":"auto","created_at":"2025-09-04 22:15:39","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":2022196,"visible":true,"origin":"","legend":"\u003cp\u003eSpatial distribution of landslide events at the regional \u003cstrong\u003e(a)\u003c/strong\u003e and provincial \u003cstrong\u003e(c)\u003c/strong\u003elevels in Türkiye, together with the corresponding frequency of occurrences \u003cstrong\u003e(b and d)\u003c/strong\u003e.\u003c/p\u003e","description":"","filename":"floatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-7463555/v1/cc6846d1237ed624f0ef7e17.png"},{"id":90623996,"identity":"b95a00c8-41be-4120-bc10-41f79c6be4d9","added_by":"auto","created_at":"2025-09-04 22:23:39","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":2035667,"visible":true,"origin":"","legend":"\u003cp\u003eSpatial distribution of landslide fatalities at the regional \u003cstrong\u003e(a)\u003c/strong\u003e and provincial \u003cstrong\u003e(c)\u003c/strong\u003elevels in Türkiye, together with the corresponding frequency of fatalities \u003cstrong\u003e(b and d)\u003c/strong\u003e.\u003c/p\u003e","description":"","filename":"floatimage7.png","url":"https://assets-eu.researchsquare.com/files/rs-7463555/v1/05a33fc311c0a30f15fb301c.png"},{"id":90623784,"identity":"24911cb6-e103-4cba-8fcf-118a9236d1ec","added_by":"auto","created_at":"2025-09-04 22:15:39","extension":"png","order_by":8,"title":"Figure 8","display":"","copyAsset":false,"role":"figure","size":1163473,"visible":true,"origin":"","legend":"\u003cp\u003eDistribution of landslide triggers. \u003cstrong\u003e(a) \u003c/strong\u003eSpatial distribution of events, and \u003cstrong\u003e(b)\u003c/strong\u003efrequency of events according to reported triggers. N/A (not available) refers to events with unknown triggers.\u003c/p\u003e","description":"","filename":"floatimage8.png","url":"https://assets-eu.researchsquare.com/files/rs-7463555/v1/18f7c17566d2b0a03a66a412.png"},{"id":90623793,"identity":"05d9f05e-7d23-4ac4-a0a5-519bccaf657d","added_by":"auto","created_at":"2025-09-04 22:15:40","extension":"png","order_by":9,"title":"Figure 9","display":"","copyAsset":false,"role":"figure","size":2668834,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003e(a)\u003c/strong\u003e Spatial distribution of landslide risk categories across Türkiye at the neighborhood level, classified into five ordinal risk groups: Very Low, Low, Moderate, High, and Very High. \u003cstrong\u003e(b)\u003c/strong\u003eBoxplot comparison of standardized log-transformed risk scores across all neighborhoods, and separately for fatal and non-fatal areas. \u003cstrong\u003e(c)\u003c/strong\u003eHistogram of normalized risk scores (mean-centered and scaled), overlaid with a normal distribution curve (red) and the mean (blue dashed line). \u003cstrong\u003e(d)\u003c/strong\u003eNumber of neighborhoods for each risk class, illustrating the distribution of risk categories.\u003c/p\u003e","description":"","filename":"floatimage9.png","url":"https://assets-eu.researchsquare.com/files/rs-7463555/v1/8ddf6d6001d6810e342feb18.png"},{"id":99172290,"identity":"19ea2c5b-2eeb-4592-beac-2874e5400d1c","added_by":"auto","created_at":"2025-12-29 16:07:19","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":15170761,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7463555/v1/c748184d-55c9-4a8a-8d7e-0995217828e8.pdf"},{"id":90623771,"identity":"4803ae28-795b-42af-bee4-36c91818786e","added_by":"auto","created_at":"2025-09-04 22:15:39","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":30291,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryInformation.docx","url":"https://assets-eu.researchsquare.com/files/rs-7463555/v1/27a3f5c5ea183a4679bce35e.docx"}],"financialInterests":"","formattedTitle":"Generating Landslide Archive Inventories Using Web Scraping and NLP Techniques for Türkiye","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eLandslides are among the most common natural hazards, resulting in substantial loss of life and substantial economic damage worldwide (Fidan et al., \u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e2024\u003c/span\u003e; Froude \u0026amp; Petley, \u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e2018\u003c/span\u003e; Kirschbaum et al., \u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e2015\u003c/span\u003e). Landslides triggered by rainfall (Emberson et al. \u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e2022\u003c/span\u003e; Ozturk et al. \u003cspan citationid=\"CR62\" class=\"CitationRef\"\u003e2022\u003c/span\u003e), earthquakes (G\u0026ouml;rum et al., 2011, 2025; Tanyaş et al., \u003cspan citationid=\"CR68\" class=\"CitationRef\"\u003e2017\u003c/span\u003e), and human activities (Guns and Vanacker \u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e2014\u003c/span\u003e; Depicker et al. \u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e2021\u003c/span\u003e; Ozturk et al. \u003cspan citationid=\"CR62\" class=\"CitationRef\"\u003e2022\u003c/span\u003e) claim substantial loss of life and socio-economic damage (Froude and Petley \u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e2018\u003c/span\u003e). As a result, recording landslide events contributes to the prevention of loss of life and property by providing an understanding of their spatial and temporal distribution, as well as identifying the factors that control their formation (G\u0026oacute;mez et al., \u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Kirschbaum et al., \u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e2010\u003c/span\u003e; van Westen et al., \u003cspan citationid=\"CR71\" class=\"CitationRef\"\u003e2006\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eLandslide inventories are a fundamental data source for susceptibility, hazard, and risk analyses (Guzzetti et al. \u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e2005\u003c/span\u003e, \u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e2012\u003c/span\u003e; van Westen et al. \u003cspan citationid=\"CR71\" class=\"CitationRef\"\u003e2006\u003c/span\u003e; Rossi et al. \u003cspan citationid=\"CR65\" class=\"CitationRef\"\u003e2019\u003c/span\u003e; Caleca et al. \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e2025\u003c/span\u003e), as well as the development of early warning systems (Guzzetti et al. \u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e2020\u003c/span\u003e; Fang et al. \u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). Compiled to obtain more insights into landslides, inventories are typically of archive, historical, event-based, seasonal, and multi-temporal (Guzzetti et al. \u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e2012\u003c/span\u003e). Among these, archive inventories with large spatial and temporal coverage are widely used, compiling information from heterogeneous sources such as newspapers, media archives, and technical or scientific reports (Guzzetti et al. \u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e1994\u003c/span\u003e; Herv\u0026aacute;s \u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e2013\u003c/span\u003e; Klose et al. \u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e2016\u003c/span\u003e). They can record all known landslide events and cover periods of up to hundreds of years on various scales. For example, global efforts have been made to understand the spatial and temporal trends of rainfall-induced (Kirschbaum et al. \u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e2009\u003c/span\u003e, \u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e2012\u003c/span\u003e) and fatal landslides (Petley \u003cspan citationid=\"CR63\" class=\"CitationRef\"\u003e2012\u003c/span\u003e; Froude and Petley \u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e2018\u003c/span\u003e; Haque et al. \u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e2019\u003c/span\u003e). Also, regional inventories have been compiled using multi-country approaches that combine national records in Europe (Van Den Eeckhaut and Herv\u0026aacute;s \u003cspan citationid=\"CR70\" class=\"CitationRef\"\u003e2012\u003c/span\u003e; Haque et al. \u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e2016\u003c/span\u003e), Latin America, and the Caribbean (Sep\u0026uacute;lveda and Petley \u003cspan citationid=\"CR66\" class=\"CitationRef\"\u003e2015\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eDespite global and regional inventories having made useful contributions to the understanding of landslide events, they have limitations in comprehensively identifying landslides that exhibit complex spatial and temporal patterns (Kirschbaum et al., \u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e2010\u003c/span\u003e; Petley, \u003cspan citationid=\"CR63\" class=\"CitationRef\"\u003e2012\u003c/span\u003e). Global datasets tend to represent only a fraction of landslide events, focusing on those reported in the international media or fatal landslide events in terms of their impacts and consequences (Spizzichino et al. \u003cspan citationid=\"CR67\" class=\"CitationRef\"\u003e2010\u003c/span\u003e; Sep\u0026uacute;lveda and Petley \u003cspan citationid=\"CR66\" class=\"CitationRef\"\u003e2015\u003c/span\u003e). This underestimation results in spatial and temporal gaps, causing global inventories to lose their capacity to accurately represent landslides at the national level. For instance, the Global Fatal Landslide Database (GFLD) reported only 53 fatal landslides in T\u0026uuml;rkiye (Froude and Petley \u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e2018\u003c/span\u003e). During the same period, the Fatal Landslides Database of T\u0026uuml;rkiye (FATALDOT) compiled 191 events (G\u0026ouml;r\u0026uuml;m and Fidan \u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e2021\u003c/span\u003e). Also, while the Global Landslide Catalog (GLC) recorded only 67 rainfall-induced landslides in Italy (Kirschbaum et al., \u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e2015\u003c/span\u003e), the newly developed e-ITALICA catalog increased this number to 6,312 (Brunetti et al. \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2025\u003c/span\u003e). Systematically compiled national inventories provide more consistent and standardized records in terms of spatial and temporal coverage. Several countries, for example, Italy (Guzzetti \u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e2000\u003c/span\u003e; Calvello and Pecoraro \u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e2018\u003c/span\u003e; Brunetti et al. \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2025\u003c/span\u003e), Colombia (Aristiz\u0026aacute;bal and S\u0026aacute;nchez \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2020\u003c/span\u003e; Garcia-Delgado et al. \u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e2022\u003c/span\u003e), the United States (Mirus et al. \u003cspan citationid=\"CR59\" class=\"CitationRef\"\u003e2020\u003c/span\u003e), China (Lin and Wang \u003cspan citationid=\"CR56\" class=\"CitationRef\"\u003e2018\u003c/span\u003e; Zhang et al. \u003cspan citationid=\"CR76\" class=\"CitationRef\"\u003e2023\u003c/span\u003e), Germany (Damm \u0026amp; Klose, \u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e2015\u003c/span\u003e), and T\u0026uuml;rkiye (Fidan and G\u0026ouml;r\u0026uuml;m \u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e2020\u003c/span\u003er\u0026uuml;m and Fidan 2021), have developed national inventories that provide more detailed and reliable landslide records. The use of local language (Sep\u0026uacute;lveda and Petley \u003cspan citationid=\"CR66\" class=\"CitationRef\"\u003e2015\u003c/span\u003e) and national sources enables the capture of events that are not usually found in global inventories. Nevertheless, compiling, integrating, and analyzing such inventories often requires a significant amount of time and effort, making their development both labor-intensive and operationally challenging.\u003c/p\u003e\u003cp\u003eOver the past few years, the use of digital technologies has led to a fundamental change in the methods of collecting, analyzing, and sharing natural hazard data. Specifically, with the widespread use of the internet, a tremendous amount of information concerning natural hazards is created on online platforms such as news portals, government institution websites, social media, and digital archives (Lai et al. \u003cspan citationid=\"CR53\" class=\"CitationRef\"\u003e2022\u003c/span\u003e; Avcıoğlu et al. \u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e2025\u003c/span\u003e). In this situation, the web scraping that has been developed as a means for gathering information from the virtual world is, thus, a more rapid, inexpensive, and less labor-intensive alternative than the conventional methods for data collection (Cording \u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e2011\u003c/span\u003e; Vargiu and Urru \u003cspan citationid=\"CR72\" class=\"CitationRef\"\u003e2012\u003c/span\u003e; Chauhan et al. \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e2023\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eThe rapid growth of digital content and advances in Natural Language Processing (NLP) (Young et al. \u003cspan citationid=\"CR75\" class=\"CitationRef\"\u003e2018\u003c/span\u003e; Kang et al. \u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e2020\u003c/span\u003e; Raffel et al. \u003cspan citationid=\"CR64\" class=\"CitationRef\"\u003e2020\u003c/span\u003e; Koltsakis et al. \u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Kumar and Renuka \u003cspan citationid=\"CR52\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) have created new opportunities for transitioning from analog methods to automation in collecting data related to natural hazards. Web-based methods, especially web scraping and crawling, combined with NLP, are increasingly being used to automatically develop natural hazard databases by extracting information from large volumes of text (Avcıoğlu et al., \u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e2025\u003c/span\u003e; Battistini et al., \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e2013\u003c/span\u003e, \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e2017\u003c/span\u003e; Carley et al., \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e2016\u003c/span\u003e; Goswami et al., \u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e2018\u003c/span\u003e; Lausch et al., \u003cspan citationid=\"CR54\" class=\"CitationRef\"\u003e2015\u003c/span\u003e). These techniques enable the systematic collection of data on natural hazards by extracting pertinent information, such as location, date, number of deaths and casualties, and triggers, from online news reports, official reports, social media, and public archives (Battistini et al. \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e2013\u003c/span\u003e; Carley et al. \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e2016\u003c/span\u003e; Goswami et al. \u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e2018\u003c/span\u003e). Consequently, these new methods enable the monitoring and recording of natural hazards in near real-time, either to supplement or replace more conventional observational methods.\u003c/p\u003e\u003cp\u003eIn accordance with new developments, an increasing number of studies have adopted web-based techniques to automatically compile landslide inventories from digital media sources. These approaches have proven effective in capturing events that were missed by traditional inventories and in enhancing the spatial and temporal completeness of landslide records. For example, after searching local newspapers, 111 previously unrecognized events have been added to the UK National Landslide Database, and information regarding the impacts of some 90% of the recognized landslides has been compiled (Taylor et al. \u003cspan citationid=\"CR69\" class=\"CitationRef\"\u003e2015\u003c/span\u003e). Furthermore, large-scale data mining has demonstrated the ability to significantly expand the scope of events, even in municipalities not previously included in Italy's existing inventories (Franceschini et al. \u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e2022\u003c/span\u003e). Even though web-based inventories play a supporting role for national databases, manual verification is still very important due to location uncertainties, inconsistent terms, and missing information (Battistini et al. \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e2017\u003c/span\u003e; Kreuzer and Damm \u003cspan citationid=\"CR51\" class=\"CitationRef\"\u003e2020\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eAlthough it has become possible to detect landslide events from digital news sources automatically, research in this area remains limited in terms of scope and accuracy. For instance, existing landslide inventories generally do not supply spatial resolution beyond the provincial level, and only a small fraction are verified through comparative analysis with manually compiled landslide datasets (Avcıoğlu et al., \u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e2025\u003c/span\u003e; Franceschini et al., \u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e2022\u003c/span\u003e). In T\u0026uuml;rkiye, where landslides are widespread and deadly, a web-based inventory that includes systematic validation is still not available.\u003c/p\u003e\u003cp\u003eTo address this gap, this study created, validated, and spatially analyzed a national landslide inventory using web scraping methods (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). Here, a fully automated framework was developed to detect, map, and analyze fatal and non-fatal landslides in T\u0026uuml;rkiye using online media news sources. The approach captures landslide events by integrating web scraping, NLP, and spatial inference and assigns them to the administrative neighborhoods using geocoding routers when location information is available in media news. The web-based inventory was also validated with a manually compiled national landslide database (G\u0026ouml;r\u0026uuml;m et al., \u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e2025\u003c/span\u003e) and provides a risk estimation at the neighborhood level. By combining real-time data compiled with spatial accuracy and validity, this study suggests how automated inventories can go beyond incident detection to provide actionable risk information at the local level.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e"},{"header":"2. Materials and methods","content":"\u003cp\u003eWe developed a web-scraping algorithm to automatically detect and analyze landslide events in T\u0026uuml;rkiye from digital media sources (Najatishendi \u003cspan citationid=\"CR61\" class=\"CitationRef\"\u003e2025\u003c/span\u003e). For this purpose, first, URLs were collected from news websites by identifying relevant keywords. During the web scraping process, we examined the HTML structures and extracted key information, including titles, content, and publication dates. We then analyzed the resulting text using natural language processing (NLP) techniques to classify the location, date, number of deaths, and triggering factors. Landslide events were spatially mapped by geocoding place names using Nominatim and Geopy. All extracted information was compiled into a structured dataset and validated against a manually prepared inventory. Finally, we calculated and spatially analyzed a risk estimate that considers landslide probability, exposure, and fatalities (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e).\u003c/p\u003e\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\u003ch2\u003e2.1. Data collection\u003c/h2\u003e\u003cp\u003eIn this study, we utilized web scraping techniques to identify landslide events in digital news sources. In this context, we analyzed the HTML structures of various news websites in T\u0026uuml;rkiye in detail and developed data extraction strategies suitable for the content presentation style of each source. HTML tags, classes, and ID structures in web pages were considered the main reference points for accurate content extraction.\u003c/p\u003e\u003cp\u003eNevertheless, changes in the HTML structure of websites over time can negatively impact the accuracy of the extraction process. In particular, when a style class name is changed or a content block is moved to a different structure, non-updated systems may not recognize the relevant data, resulting in incomplete or inaccurate data collection. To minimize such problems, we structured the scrapers to be as flexible, traceable, and updatable as possible. While such incompatibilities are less common for sites using structured data standards (e.g., Schema.org), a systematic maintenance process became necessary for resources with non-standard HTML layouts.\u003c/p\u003e\u003cp\u003eIn the process of collecting URLs, we used a Google search engine scraper to identify news articles related to landslides. The resulting list of links was processed by a second scraper we developed to extract news headlines, publication dates, keywords, and news bodies. At this stage, we used a list of Turkish keywords to pre-filter the content related to landslides. We also identified their English equivalents for comparison with international practices (Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e).\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eKeywords (Turkish) used to capture digital news sources on landslides in T\u0026uuml;rkiye. Keywords are given with their English equivalents.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"2\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eKeywords in T\u0026uuml;rkiye\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eKeywords in English\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\"\u0026ccedil;amur akıntısı\", \"\u0026ccedil;amur akması\", \"\u0026ccedil;amur hareketi\", \"heyelan\", \"kaya \u0026ccedil;\u0026ouml;kmesi\", \"kaya devrilmesi\", \"kaya d\u0026uuml;şmesi\", \"kaya hareketi\", \"kaya kayması\", \"kaya yuvarlanması\", \"moloz akıntısı\", \"moloz akması\", \"moloz hareketi\", \"toprak \u0026ccedil;\u0026ouml;kmesi\", \"toprak hareketi\", \"toprak kayması\", \"toprak s\u0026uuml;r\u0026uuml;klenmesi\", \"yama\u0026ccedil; kayması\", \"yer \u0026ccedil;\u0026ouml;kmesi\", \"yer hareketleri\", \"yer kayması\", \"zemin \u0026ccedil;\u0026ouml;kmesi\", \"zemin hareketi\", \"zemin kayması\"\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e\"mud flow\", \"mudflow (mudslide)\", \"mud movement\", \"landslide\", \"rock collapse\", \"rock overturning\", \"rock fall\", \"rock movement\", \"rock sliding\", \"rock rolling\", \"debris flow\", \"debris flow\", \"debris movement\", \"soil collapse\", \"soil movement\", \"soil sliding\", \"soil dragging\", \"slope sliding\", \"ground substance\", \"ground movements\", \"ground sliding\", \"ground collapse\", \"ground movement\", \"ground sliding\"\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec4\" class=\"Section2\"\u003e\u003ch2\u003e2.2. Natural Language Processing (NLP)\u003c/h2\u003e\u003cp\u003eNLP, a branch of artificial intelligence (AI), is one of the driving forces supporting a computer's ability to understand, analyze, and produce human language (Jurafsky and Martin 2020). In the disaster news analysis perspective, NLP is a core technology that turns non-structured text data into structured and analyzable input (Bird et al. \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e2009\u003c/span\u003e). News articles that were already collected are turned to NLP techniques to extract details automatically, such as the event location (city, district, village, neighborhood), the number of casualties (dead, injured, missing), the kind of landslide (e.g., heavy precipitation, geological instability), the date of the event, and, if the situation appears several times, the total number of landslides (Manning et al. \u003cspan citationid=\"CR58\" class=\"CitationRef\"\u003e2014\u003c/span\u003e).\u003c/p\u003e\u003cp\u003ePython is the programming language of choice in this area due to its simplicity and the extensive availability of NLP and web scraping libraries. Web scraping tools, such as BeautifulSoup, Scrapy, and Requests, are the three most commonly used for extracting and manipulating web content. As content is collected, it is passed through a series of NLP tasks, including text tokenization, named entity recognition (NER), event discovery, and geocoding (Bird et al. \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e2009\u003c/span\u003e; Manning et al. \u003cspan citationid=\"CR58\" class=\"CitationRef\"\u003e2014\u003c/span\u003e).\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec5\" class=\"Section2\"\u003e\u003ch2\u003e2.3. Geocoding\u003c/h2\u003e\u003cp\u003eWe used an open-source geocoding approach to automatically obtain geographic coordinates (latitude and longitude) from Turkish address data (Chow et al. \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e2016\u003c/span\u003e; Kilic et al. \u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e2023\u003c/span\u003e). For this purpose, we integrated the Nominatim service, which uses the OpenStreetMap (OSM) infrastructure, into the Python environment through the geopy library. During the coding process, we structured each address record to form four-level address combinations, considering neighborhood, village, district, and province components. Four levels of address combinations, ranked from the most comprehensive to the simplest format, were gradually applied: neighborhood-village-district-province, village-district-province, district-province, and province only. When the most detailed administrative unit-level information was insufficient, larger administrative units were referenced in the news to verify localization (e.g., Battistini et al. \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e2013\u003c/span\u003e; Froude and Petley \u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e2018\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eOur multi-stage query approach enabled location determination even for records with missing or incomplete address information. For each successful match, we added the coordinate information (settlement center) to the relevant record, and in cases where no match was obtained, we left the field blank. Also, we supported our approach with retry and delayed processing mechanisms to avoid connection problems and timeout errors.\u003c/p\u003e\u003cp\u003eWe systematically applied this methodology to a table dataset. For each address record, we tested four different combinations in sequence and added latitude-longitude information as new columns to the dataset. After the process was complete, we exported the file containing all records enriched with coordinate information as a separate Excel spreadsheet. This approach enabled the capture and processing of large spatial landslide datasets in a low-cost and reproducible manner.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec6\" class=\"Section2\"\u003e\u003ch2\u003e2.4. Validation of the web-scraped inventory\u003c/h2\u003e\u003cp\u003eTo assess the reliability of the web-based landslide inventory, we performed a systematic validation using a previously manually compiled reference landslide inventory for T\u0026uuml;rkiye (G\u0026ouml;r\u0026uuml;m et al., \u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e2025\u003c/span\u003e). The validation was performed for 2010\u0026ndash;2020, as older online news reports are often removed from the internet and thus become unavailable for web-scraping approaches. Furthermore, the manual inventory only extends up to 2020, making this period most suitable for a consistent and comprehensive comparison.\u003c/p\u003e\u003cp\u003eDue to uncertainty in spatial accuracy, the validation focused only on temporal consistency. For each landslide event in the web-based inventory, a match was accepted if at least one event in the manual inventory occurred within a window of 7 days before and 7 days after the web-based event date (i.e., \u0026plusmn;\u0026thinsp;7 days) (Taylor et al. \u003cspan citationid=\"CR69\" class=\"CitationRef\"\u003e2015\u003c/span\u003e; Battistini et al. \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e2017\u003c/span\u003e). The analysis was also run for \u0026plusmn;\u0026thinsp;1-day and \u0026plusmn;\u0026thinsp;2-day time intervals to test sensitivity to this parameter. A wider window, more than such as \u0026plusmn;\u0026thinsp;7 days, was not used to avoid matching unrelated events and artificially inflating the agreement between inventories (Battistini et al. \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e2017\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eWe performed the validation process in three steps. First, we compared each event collected from the web with the manual inventory. Here, if the manual event was recorded within \u0026plusmn;\u0026thinsp;n days of the date of the event collected from the web, it was counted as a true positive (TP). Next, we defined false positives (FP) as events collected from the web for which no manual record was found within the window. Finally, we classified false negatives (FN) as manual events for which the event collected from the web did not occur within the window (Lai et al. \u003cspan citationid=\"CR53\" class=\"CitationRef\"\u003e2022\u003c/span\u003e; Bhuyan et al. \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e2023\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eTo quantify the web-scraped list's performance compared to the manual inventory, we measured three standard performance metrics: precision, recall, and F1-score. Precision is defined as the ratio of web scraping events found in the manual inventory that reflect the accuracy of perceived events. Recall is the ratio of events successfully identified by web scraping to those in the manual inventory, reflecting the completeness of detection. The F1-score represents the harmonic mean of precision and recall as a single metric that weighs accuracy and completeness evenly (Yacouby and Axman \u003cspan citationid=\"CR74\" class=\"CitationRef\"\u003e2020\u003c/span\u003e; Lai et al. \u003cspan citationid=\"CR53\" class=\"CitationRef\"\u003e2022\u003c/span\u003e; Bhuyan et al. \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e2023\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eThe metrics were computed as follows:\u003cdiv id=\"Equa\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equa\" name=\"EquationSource\"\u003e\n$$\\:Precision\\:=TP\\:/\\:(TP\\:+FP)\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\left(1\\right)$$\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Equb\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equb\" name=\"EquationSource\"\u003e\n$$\\:Recall\\:=TP\\:/\\:(TP\\:+FN)\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\left(2\\right)$$\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Equc\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equc\" name=\"EquationSource\"\u003e\n$$\\:F1=\\:2\\:x\\frac{Precision\\:x\\:Recall}{Precision\\:+\\:Recall}\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\left(3\\right)$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003ewhere TP, FP, and FN represent true positives, false positives, and false negatives, respectively.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec7\" class=\"Section2\"\u003e\u003ch2\u003e2.5. Risk estimation\u003c/h2\u003e\u003cp\u003eTo estimate landslide risk at the neighborhood level, we developed a method that integrates landslide probability, population exposure, and recorded fatalities.\u003c/p\u003e\u003cp\u003eIn the first step, we calculated the probability of at least one landslide occurring during the 27 years of the inventory:\u003cdiv id=\"Equd\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equd\" name=\"EquationSource\"\u003e\n$$\\:{P}_{L}=\\frac{{Landslides}_{i}\\:}{inventory\\:period}\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\left(4\\right)$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003ewhere \u003cem\u003eLandslides\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e is the total number recorded in neighborhood i over the past 27 years.\u003c/p\u003e\u003cp\u003eNext, we computed population exposure (van Westen et al. \u003cspan citationid=\"CR71\" class=\"CitationRef\"\u003e2006\u003c/span\u003e; Corominas et al. \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e2014\u003c/span\u003e; Maes et al. \u003cspan citationid=\"CR57\" class=\"CitationRef\"\u003e2017\u003c/span\u003e) as the product of the probability of landslides and the normalized population (Lebakula et al. \u003cspan citationid=\"CR55\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) of each neighborhood, based on the area.\u003cdiv id=\"Eque\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Eque\" name=\"EquationSource\"\u003e\n$$\\:{Exposure}_{i}={P}_{L}\\:x\\:\\left(\\frac{{Population}_{i}}{{Area}_{i}}\\right)\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\left(5\\right)$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003ewhere \u003cem\u003eExposure\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e represents the population exposure for neighborhood i, \u003cem\u003eP\u003c/em\u003e\u003csub\u003e\u003cem\u003eL\u003c/em\u003e\u003c/sub\u003e is the probability of at least one landslide occurring within 27 years, \u003cem\u003eExposure\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e is the total population of the neighborhood, and \u003cem\u003eArea\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e is the area in square kilometers of the neighborhood. Area normalization fixes spatial heterogeneity and prevents overestimation of exposure in large administrative units.\u003c/p\u003e\u003cp\u003eFinally, we estimated the risk of landslides (Varnes \u003cspan citationid=\"CR73\" class=\"CitationRef\"\u003e1984\u003c/span\u003e; Corominas et al. \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e2014\u003c/span\u003e) at the neighborhood level using two different approaches, depending on whether fatalities were recorded in each neighborhood. In neighborhoods where landslide-related fatalities (\u003cem\u003eFatalities\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e) were recorded, we defined risk as the ratio of total fatalities to the estimated exposure:\u003cdiv id=\"Equf\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equf\" name=\"EquationSource\"\u003e\n$$\\:{Risk}_{i}=\\:\\frac{{Fatalities}_{i}}{{Exposure}_{i}}\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\left(6\\right)$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eOn the other hand, in neighborhoods where no deaths were recorded, we calculated a proxy risk estimate using relative landslide frequency and exposure. Here, \u003cem\u003eLandslides\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e shows the total number of landslides recorded in the neighborhood, whereas \u003cem\u003eTotal\u003c/em\u003e\u003csub\u003e\u003cem\u003eL\u003c/em\u003e\u003c/sub\u003e shows the total number of landslides recorded in all neighborhoods.\u003cdiv id=\"Equg\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equg\" name=\"EquationSource\"\u003e\n$$\\:{Risk}_{i}=\\:\\left(\\frac{{Landslides}_{i}}{{Total}_{L}}\\:\\right)\\:x\\:\\:{Exposure}_{i}\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\left(7\\right)$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eWe also applied a median-based normalization to ensure comparability between fatal and non-fatal neighborhoods. We calculated the proxy risk values for all neighborhoods with no recorded fatalities, then calculated the median of these values (\u003cem\u003eMedian\u003c/em\u003e\u003csub\u003e\u003cem\u003eNonfatal\u003c/em\u003e\u003c/sub\u003e). Then, we rescaled the risk value of non-fatal neighborhoods by multiplying it by the ratio of the median risk to \u003cem\u003eMedian\u003c/em\u003e\u003csub\u003e\u003cem\u003eNonfatal\u003c/em\u003e\u003c/sub\u003e in fatal neighborhoods (\u003cem\u003eMedian\u003c/em\u003e\u003csub\u003e\u003cem\u003eFatal\u003c/em\u003e\u003c/sub\u003e):\u003cdiv id=\"Equh\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equh\" name=\"EquationSource\"\u003e\n$$\\:{Risk}_{iN}=\\:{Risk}_{i}\\:x\\:\\left(\\frac{{Median}_{Fatal}}{{Median}_{Nonfatal}}\\right)\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\left(8\\right)$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eThis scaling adjusts the two risk distributions and regulates potential bias in non-fatal areas (Fig.\u0026nbsp;\u003cspan refid=\"Fig9\" class=\"InternalRef\"\u003e9\u003c/span\u003eb). Although landslide risk is calculated using two different formulations based on the presence or absence of fatalities, both outputs are referred to as \u003cem\u003eRisk\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e for uniformity and comprehensibility in the manuscript. Also, we categorized risk areas as very high, high, moderate, low, and very low risk (van Westen et al. \u003cspan citationid=\"CR71\" class=\"CitationRef\"\u003e2006\u003c/span\u003e).\u003c/p\u003e\u003c/div\u003e"},{"header":"3. Result","content":"\u003cp\u003eUsing our web scraping algorithm, we captured 3051 news articles from across T\u0026uuml;rkiye between 1997 and 2024. We then separated the texts according to event date, location, and content attributes. Blank or incorrect province, district, and neighborhood names were corrected according to administrative standards. We applied a filter that included text classification and date-location consistency checks to remove duplicate news articles and those that did not report actual landslide events. Similar records captured within a\u0026thinsp;\u0026plusmn;\u0026thinsp;3-day window at the same location -province, district, neighborhood - were considered duplicates and removed from the dataset. Ultimately, we labeled 1727 news articles as records reporting an actual landslide event.\u003c/p\u003e\u003cp\u003eOur new database provides information on the ID, date, latitude, longitude, location (i.e., region, subregion, province, district, and neighborhood), number of deaths and injuries, and triggering factor (e.g., natural or anthropogenic) for each landslide event scraped from the web. We also classified 212 news articles that resulted in 478 deaths as fatal landslides.\u003c/p\u003e\u003cp\u003eIn determining the location of each landslide event, we matched the location names reported in news articles to the corresponding existing settlement centers. In this respect, we located 66.5% of the 1727 landslide events at the neighborhood/village level (n\u0026thinsp;=\u0026thinsp;1149), 30.6% at the district level (n\u0026thinsp;=\u0026thinsp;528), and 2.9% at the province level only (n\u0026thinsp;=\u0026thinsp;50). We reported the location accuracy of 66.5% of total landslides in a neighborhood/village administrative unit with an average planimetric width of 16 km\u0026sup2; and a variation of 26 km\u0026sup2; measured in one standard deviation. For the other 33.5%, location accuracy reaches up to a mean planimetric width of 1000 km\u0026sup2;.\u003c/p\u003e\u003cdiv id=\"Sec9\" class=\"Section2\"\u003e\u003ch2\u003e3.1. Quantitative assessment of validation\u003c/h2\u003e\u003cp\u003eWe tested the accuracy of the web-scraped landslide inventory by comparing it with a manually compiled inventory. We established a temporal agreement with tolerance levels of \u0026plusmn;\u0026thinsp;1, \u0026plusmn;2, \u0026plusmn;\u0026thinsp;3, \u0026plusmn;5, and \u0026plusmn;\u0026thinsp;7 days, considering the event dates. Analysis for the years 2010\u0026ndash;2020 revealed that the choice of temporal tolerance had a significant influence on agreement measures between the two inventories. Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e and Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e provide the detailed results for the three temporal tolerance levels.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003ePerformance metrics for different temporal tolerance windows (2010\u0026ndash;2020)\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"7\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eTolerance (days)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eTP\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eFP\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eFN\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003ePrecision\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003eRecall\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c7\"\u003e\u003cp\u003eF1-score\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u0026plusmn;\u0026thinsp;1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e636\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e202\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e1455\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.759\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.304\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.434\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u0026plusmn;\u0026thinsp;2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e712\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e126\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e1379\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.850\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.341\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.486\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u0026plusmn;\u0026thinsp;3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e757\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e81\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e1334\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.903\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.362\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.517\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u0026plusmn;\u0026thinsp;5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e796\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e42\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e1295\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.950\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.381\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.544\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u0026plusmn;\u0026thinsp;7\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e808\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e30\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e1283\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.964\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.386\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.552\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003ctfoot\u003e\u003ctr\u003e\u003ctd colspan=\"7\"\u003e\u003cb\u003eNote\u003c/b\u003e: True Positives (TP), False Positives (FP), and False Negatives (FN).\u003c/td\u003e\u003c/tr\u003e\u003c/tfoot\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eUsing a\u0026thinsp;\u0026plusmn;\u0026thinsp;1-day window, 636 web-scraped events were matched to manual records (TP), with 202 events unmatched (FP), and 1455 manual events not detected by web scraping (FN). The precision, recall, and F1-score that correspond with these were 0.759, 0.304, and 0.434, respectively. As the window widened to \u0026plusmn;\u0026thinsp;5 and \u0026plusmn;\u0026thinsp;7 days, these metrics improved notably, with the \u0026plusmn;\u0026thinsp;7-day window providing the highest performance levels (precision\u0026thinsp;=\u0026thinsp;0.964, recall\u0026thinsp;=\u0026thinsp;0.386, F1\u0026thinsp;=\u0026thinsp;0.522, see Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e and Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eSince the F1-score is the harmonic mean of precision and recall, low recall limits the F1-score upward. When \u0026plusmn;\u0026thinsp;7 days are reached, precision is already close to 0.96, so the only way to increase the F1-score further is to improve recall significantly. However, as tolerance widens, additional matching gains exhibit diminishing returns. Hence, the increase in the F1-score remains limited (from 0.486 to 0.552).\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec10\" class=\"Section2\"\u003e\u003ch2\u003e3.2. Temporal distribution\u003c/h2\u003e\u003cp\u003eCovering the 27-year period from 1997 to 2024, the annual number of landslides remained low until 2007, but started to exhibit its first increase as of 2008. The annual number of landslides exceeded 50 and 100, respectively, in 2014 and 2018. By 2024, the number of landslides reached its highest with 276 (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003ea). Despite annual variations, the number of landslides generally shows an increasing trend. Consequently, we documented that an average of 64 landslides occurred annually in T\u0026uuml;rkiye between 1997 and 2024.\u003c/p\u003e\u003cp\u003eThe temporal distribution of landslide fatalities is not uniform (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003eb). Until 2008, the number of deaths was mostly less than 10. Although the number of deaths began to increase in 2008, it remained low between 2009 and 2015. Between 2016 and 2024, more deaths were reported than in previous years. Particularly, two sharp increases were recorded in 2016 (n\u0026thinsp;=\u0026thinsp;77) and 2023 (n\u0026thinsp;=\u0026thinsp;63). These peaks and irregular fluctuations in the annual death rate are related to individual events causing multiple deaths. For example, in a single landslide event in 2016 and 2023, 44 and 15 people died, respectively. Over a 27-year period, landslides in T\u0026uuml;rkiye have caused 478 deaths, which is an average of 18 people per year.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eFigure \u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e shows the spatial distribution of landslide events on a monthly basis. Between 1997 and 2024, landslide events were concentrated in the winter season (31.2%, n\u0026thinsp;=\u0026thinsp;538). The number of events, which was relatively low at 127 in December, increased to 199 in January and then to 212 in February, reaching the highest frequency (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ea and b). However, the winter season is also the period with the lowest rate of deaths \u0026mdash; 14%, n\u0026thinsp;=\u0026thinsp;67 \u0026mdash; (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ec and d).\u003c/p\u003e\u003cp\u003eIn spring and summer, which account for 25.8% (n\u0026thinsp;=\u0026thinsp;445) and 24.6% (n\u0026thinsp;=\u0026thinsp;424) of the total number of incidents, respectively (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ea and b), approximately 56% of total deaths were recorded \u0026mdash; spring: n\u0026thinsp;=\u0026thinsp;125, 26.2%; summer: n\u0026thinsp;=\u0026thinsp;142, 29.7% \u0026mdash; (Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003ec and d). Although autumn accounted for only 18.5% of total landslide events (n\u0026thinsp;=\u0026thinsp;320), it was the season with the highest mortality rate (30.1%; n\u0026thinsp;=\u0026thinsp;144).\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec11\" class=\"Section2\"\u003e\u003ch2\u003e3.3. Spatial distribution and triggering factors\u003c/h2\u003e\u003cp\u003eOur web scraping algorithm assigns the administrative unit mentioned in the news content to the settlement center when determining the exact location of landslide events. Therefore, since there is a margin of error (minimum average 16 km\u0026sup2;) equal to the area of the most detailed administrative unit mentioned in the news content, landslides are also grouped regionally and by province. Here, we analyze the spatial distribution of landslides (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e) and fatalities (Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003e) at the regional and provincial levels.\u003c/p\u003e\u003cp\u003eRegionally, the Black Sea area has the highest concentration of landslides (38.2%, n\u0026thinsp;=\u0026thinsp;659), while the Southeastern Anatolia region has the lowest number (5.8%, n\u0026thinsp;=\u0026thinsp;101). In other regions, landslides occurred at a rate of 18.6% (n\u0026thinsp;=\u0026thinsp;321) in Marmara, 12.3% (n\u0026thinsp;=\u0026thinsp;212) in Eastern Anatolia, 10% (n\u0026thinsp;=\u0026thinsp;172) in the Mediterranean, 9% (n\u0026thinsp;=\u0026thinsp;155) in the Aegean, and 6.2% (n\u0026thinsp;=\u0026thinsp;107) in Central Anatolia (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003ea and b).\u003c/p\u003e\u003cp\u003eT\u0026uuml;rkiye has 81 provinces, and at least one landslide event has been recorded in all of them. Istanbul is the province with the highest number of landslide events (n\u0026thinsp;=\u0026thinsp;180) and accounts for 10% of the records. Following Istanbul, the provinces of Rize (n\u0026thinsp;=\u0026thinsp;97), Trabzon (n\u0026thinsp;=\u0026thinsp;94), Artvin (n\u0026thinsp;=\u0026thinsp;91), Ordu (n\u0026thinsp;=\u0026thinsp;67), and Zonguldak (n\u0026thinsp;=\u0026thinsp;554) stand out as the provinces with the highest landslide frequency (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003ec and d).\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eAt the regional level, landslide fatalities are also most frequently observed in the Black Sea region. A total of 210 deaths have been recorded in the Black Sea region, accounting for 44% of T\u0026uuml;rkiye's landslide fatalities (Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003ea and b). This is followed by the Marmara (12.3%, n\u0026thinsp;=\u0026thinsp;59), Mediterranean (11.7%, n\u0026thinsp;=\u0026thinsp;56), Southeastern Anatolia (10.7%, n\u0026thinsp;=\u0026thinsp;51), Central Anatolia (9.2%, n\u0026thinsp;=\u0026thinsp;44), and Eastern Anatolia (6.9%, n\u0026thinsp;=\u0026thinsp;33) regions. The lowest number of deaths was recorded in the Aegean region (5.2%, n\u0026thinsp;=\u0026thinsp;25).\u003c/p\u003e\u003cp\u003eConsidering landslide fatalities, at least one death has been recorded in 59 of the 81 provinces (Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003ec and d). Trabzon has the highest fatality rate (77, 16.1%). Although Istanbul ranks first in terms of the number of landslides (n\u0026thinsp;=\u0026thinsp;180), the number of fatalities (n\u0026thinsp;=\u0026thinsp;29) is relatively low, placing it second. Following these are Rize (n\u0026thinsp;=\u0026thinsp;28), Kastamonu (n\u0026thinsp;=\u0026thinsp;28), Adana (n\u0026thinsp;=\u0026thinsp;24), Adıyaman (n\u0026thinsp;=\u0026thinsp;21), Artvin (n\u0026thinsp;=\u0026thinsp;16), and Sivas (n\u0026thinsp;=\u0026thinsp;15). These provinces, where landslide deaths are most concentrated, account for approximately 50% of total deaths.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eOur web scraping algorithm can also identify the triggering factors of landslide events scraped from the web (Fig.\u0026nbsp;\u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003e). We have defined the triggering factors of landslides into three categories: (i) natural (e.g., rainfall, snowmelt), (ii) anthropogenic (e.g., construction, infrastructure, mining), and (iii) N/A (not available). Among the classified landslide events, natural triggers accounted for the largest group with 1173 events, followed by anthropogenic triggers with 350 events. Also, 204 events were assigned to the N/A (not available) category due to the absence of trigger information.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec12\" class=\"Section2\"\u003e\u003ch2\u003e3.4. Distribution of risk estimates\u003c/h2\u003e\u003cp\u003eTo examine the distribution of landslides across the country in further detail, we estimated and compared landslide risk in neighborhoods nationwide by incorporating landslide events and deaths into a combined index. Risk estimations are mainly concentrated in the low (n\u0026thinsp;=\u0026thinsp;292) and moderate (n\u0026thinsp;=\u0026thinsp;230) categories, while the higher risk categories, high (n\u0026thinsp;=\u0026thinsp;171) and very high (n\u0026thinsp;=\u0026thinsp;122), include relatively fewer neighborhoods (Fig.\u0026nbsp;\u003cspan refid=\"Fig9\" class=\"InternalRef\"\u003e9\u003c/span\u003ed). The distribution of very high-risk neighborhoods is highest in Istanbul (n\u0026thinsp;=\u0026thinsp;66), followed by Rize, Zonguldak, Ankara, and Trabzon (Fig.\u0026nbsp;\u003cspan refid=\"Fig9\" class=\"InternalRef\"\u003e9\u003c/span\u003ea and Table S1). Istanbul also contains the largest number of high-risk neighborhoods (n\u0026thinsp;=\u0026thinsp;21), bringing the total number of neighborhoods in the top two risk categories to 87, more than any other province in the country (Table S1).\u003c/p\u003e\u003cp\u003eBeyond their spatial distribution, we analyzed whether the risk values exhibited consistent patterns across fatal and non-fatal neighborhoods. Standardized values of risk showed comparably distributed after median rescaling for fatal and non-fatal neighborhoods (Fig.\u0026nbsp;\u003cspan refid=\"Fig9\" class=\"InternalRef\"\u003e9\u003c/span\u003eb). Though derived from varied formulas, both groups had a similar distribution and central tendency in their log values. The standardized distribution of the risk values approximated a normal distribution with slight skewness after scaling and transformation (Fig.\u0026nbsp;\u003cspan refid=\"Fig9\" class=\"InternalRef\"\u003e9\u003c/span\u003ec).\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e"},{"header":"4. Discussion","content":"\u003cp\u003eIn this study, a landslide inventory automatically obtained from web sources using web scraping and natural language processing (NLP) techniques is presented. The web scraping algorithm identified 3051 news articles between 1997 and 2024. With the processing of this data, 1727 landslide events were confirmed, and 478 deaths caused by 212 landslides were identified. High spatial accuracy was achieved by locating 66.5% of landslide events at the neighborhood/village level. Furthermore, when we compared it to the manually compiled national inventory, it showed moderate agreement (F1\u0026thinsp;=\u0026thinsp;0.552) within a\u0026thinsp;\u0026plusmn;\u0026thinsp;7-day tolerance window. Our findings show that using web scraping with free and publicly available online sources is a good way to build archive inventories and provides a reliable data source that complements traditional inventories.\u003c/p\u003e\u003cp\u003ePrevious studies have shown that landslide inventories created using web-based approaches are located at the district and neighborhood/village level (Battistini et al. \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e2013\u003c/span\u003e; Franceschini et al. \u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e2022\u003c/span\u003e; Avcıoğlu et al. \u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e2025\u003c/span\u003e). However, we identified 66.5% of the landslide events in administrative units with an average planimetric width of 16 km\u0026sup2;, directly corresponding to the settlement centers of these units. The main reason for this positioning is that landslide news is generally reported when it affects human life, settlements, or infrastructure (Moeller \u003cspan citationid=\"CR60\" class=\"CitationRef\"\u003e2006\u003c/span\u003e; Allan et al. \u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2013\u003c/span\u003e). Therefore, we consider that locating landslides in relation to the settled area rather than the center of the administrative unit (the center of the polygon) will increase location accuracy. This approach has enabled risk estimation by providing high spatial accuracy at the neighborhood/village level (Fig.\u0026nbsp;\u003cspan refid=\"Fig9\" class=\"InternalRef\"\u003e9\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eOur web-scraping algorithm also provides more detailed information about landslide reports. For example, by extracting fatality information, fatal and non-fatal landslides were distinguished (Figs.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e and \u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003e), and the triggering factors (natural or anthropogenic) of landslides were identified (Fig.\u0026nbsp;\u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003e). Natural language processing (NLP) techniques have made it possible to identify fatal and non-fatal incidents by scanning news content for expressions such as \u0026ldquo;death, injury, loss of life\u0026rdquo; and numerical information. Similarly, using keywords and contexts such as \u0026ldquo;rain, snowmelt, construction, road work, mining,\u0026rdquo; the triggering factors of landslides have been classified as natural or anthropogenic. As a result, our analysis shows that the web-scraping method not only captures data but also contributes to landslide hazard and risk studies by enhancing data attributes.\u003c/p\u003e\u003cp\u003eHowever, since the source of the data is news articles, which events are reported and what details are included in these reports largely depend on media practices. Data collected from the web is valuable for research into natural hazards such as landslides, but it also carries inherent biases (Taylor et al. \u003cspan citationid=\"CR69\" class=\"CitationRef\"\u003e2015\u003c/span\u003e; Kreuzer and Damm \u003cspan citationid=\"CR51\" class=\"CitationRef\"\u003e2020\u003c/span\u003e). In particular, the language used in natural hazard news can be ambiguous and have multiple meanings. For example, terms like \u0026ldquo;collapse\u0026rdquo; or \u0026ldquo;slide\u0026rdquo; may not refer to an actual hazard but be used metaphorically (e.g., \u0026ldquo;the collapse of the economy,\u0026rdquo; \u0026ldquo;the team's slide\u0026rdquo;). Such metaphorical or incorrect uses can cause problems in NLP and text mining processes, leading models to mistakenly interpret these expressions as disaster events (Bird et al. \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e2009\u003c/span\u003e; Jurafsky and Martin 2020).\u003c/p\u003e\u003cp\u003eOur results show that the web scraping method is accurate in terms of the events it detects, but its coverage is limited. High precision values (\u0026plusmn;\u0026thinsp;1 day\u0026thinsp;=\u0026thinsp;0.76, \u0026plusmn;\u0026thinsp;3 days\u0026thinsp;=\u0026thinsp;0.90, and \u0026plusmn;\u0026thinsp;7 days\u0026thinsp;=\u0026thinsp;0.96) support the automatic approach's ability to accurately capture landslides reported in the media (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e and Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). In contrast, the low ratio of events successfully identified by web scraping to events in the manual inventory (recall value, \u0026plusmn;\u0026thinsp;1 day\u0026thinsp;=\u0026thinsp;0.30, \u0026plusmn;\u0026thinsp;3 days\u0026thinsp;=\u0026thinsp;0.36, and \u0026plusmn;\u0026thinsp;7 days\u0026thinsp;=\u0026thinsp;0.39) reflects the limited scope of news sources (Taylor et al. \u003cspan citationid=\"CR69\" class=\"CitationRef\"\u003e2015\u003c/span\u003e) rather than the performance of the method. For example, differences in the style of reporting news details and the reliability of information from different newspaper articles (Lai et al. \u003cspan citationid=\"CR53\" class=\"CitationRef\"\u003e2022\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eComparisons with different inventories highlight the potential and limitations of the automated method more clearly. The 1727 landslides we captured between 1997 and 2014 are very close to the 1843 landslides reported in the inventory of Avcıoğlu et al. (\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e2025\u003c/span\u003e), which covers almost the same period (1997\u0026ndash;2023). In fact, this overlap demonstrates that web scraping has high potential for capturing landslide events. Similarly, the 212 fatal landslides identified by our algorithm for the period 1997\u0026ndash;2024 are the same as the number of events reported by G\u0026ouml;r\u0026uuml;m and Fidan (\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e2021\u003c/span\u003e) in their Fatal Landslide Database of T\u0026uuml;rkiye (FATALDOT) inventory for 1997\u0026ndash;2019. Also, the manually compiled FATALDOT inventory, which also includes historical records, has reported a total of 389 fatal landslides over a more extended period (1929\u0026ndash;2019).\u003c/p\u003e\u003cp\u003eNevertheless, while 2091 landslides were recorded in the inventory used for validation covering the same period by G\u0026ouml;r\u0026uuml;m et al. (\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e2025\u003c/span\u003e), only 838 events could be detected using the web scraping method. This difference shows that there are limitations in terms of the complete recording of all events and that it also provides a measurable method for error rates.\u003c/p\u003e\u003cp\u003eIn particular, the inability to include events that are not covered by small-scale or local media (Carrara et al. \u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e2003\u003c/span\u003e; Guzzetti and Tonelli \u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e2004\u003c/span\u003e) in the inventory shows that the database created by web scraping is a complementary tool rather than a substitute for manual inventory. Therefore, while web scraping is a powerful tool for systematically capturing current and recent data (e.g., Innocenzi et al. \u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e2017\u003c/span\u003e; Kreuzer and Damm \u003cspan citationid=\"CR51\" class=\"CitationRef\"\u003e2020\u003c/span\u003e; Franceschini et al. \u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e2022\u003c/span\u003e), manual inventories perform a complementary function in terms of historical landslide events (e.g., Guzzetti \u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e2000\u003c/span\u003e; Guzzetti et al. \u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e2005\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eOverall, our study shows that web scraping and natural language processing (NLP) techniques offer great potential for natural hazard inventories. However, the ability to produce a nearly complete inventory of this potential is not limited to technical algorithmic improvements. For this purpose, news content related to natural hazards such as landslides, floods, avalanches, fires, and sinkholes must be presented in a more consistent and structured format according to international standards. For example, reporting basic elements such as the time, location, magnitude, effects, and casualties of an event in a standardized manner will enable these techniques to work more comprehensively and accurately.\u003c/p\u003e\u003cp\u003eIn this context, cooperation between media organizations, disaster management agencies, and the scientific community, supported by international organizations such as United Nations Educational, Scientific and Cultural Organization (UNESCO), United Nations Office for Disaster Risk Reduction (UNDRR), and World Meteorological Organization (WMO), is critical for establishing common standards for disaster reporting. Such an initiative will enable faster and more accurate recording of not only landslides but all natural hazards on a global scale, thereby making substantial contributions to scientific research as well as risk management and early warning systems.\u003c/p\u003e"},{"header":"5. Conclusion","content":"\u003cp\u003eThis study demonstrates the potential of automated approaches based on digital media news in producing landslide archive inventories. Our approach, which combines web scraping, natural language processing (NLP), and geocoding techniques, verified 1727 landslides from 3051 news articles covering the period 1997\u0026ndash;2024, determining that 212 of these were fatal and resulted in a total of 478 deaths. This approach not only captures landslide events but also automatically filters out fatal and non-fatal cases, the number of casualties, and triggering factors, as well as other key attributes.\u003c/p\u003e\u003cp\u003eOur results have also provided substantial outputs in terms of spatial accuracy. Locating 66.5% of landslide events at the neighborhood/village level has enabled detailed resolution that can be used for risk prediction assessments. Comparisons with manually compiled national inventories show that F1 scores ranging from 0.434 to 0.552 obtained within time windows of \u0026plusmn;\u0026thinsp;1 to \u0026plusmn;\u0026thinsp;7 days represent an acceptable method. These findings suggest that web-based automated approaches can perform a complementary and extended alternative function to traditional inventories.\u003c/p\u003e\u003cp\u003eHowever, the limitations of the method need to be considered. As the location information in news articles primarily refers to settlement names, this results in the exact location of landslides being underrepresented. The linguistic diversity of news reports, differences in terminology, and the lack of media coverage of events can also create gaps. Therefore, it is important to integrate web-based inventories with supporting information sources and improve text mining algorithms in order to provide a more comprehensive representation. Also needed are efforts to improve multilingual data processing capabilities and reduce reporting biases.\u003c/p\u003e\u003cp\u003eConsequently, this study shows that the integration of web scraping, natural language processing (NLP), and geocoding techniques can be an alternative to traditional landslide archive inventories, offering low-cost, scalable, and near-real-time updates, especially at the national scale. Future research, algorithm improvements, and initiatives to standardize the reporting of natural hazard news will further develop this potential. Thus, it is not only applicable to T\u0026uuml;rkiye, but can also be adapted to different languages and applied to various countries or on a global scale. In this context, substantial contributions can be made to landslide risk management, early warning systems, and scientific research.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eAcknowledgments\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis study is derived from the first author\u0026rsquo;s doctoral thesis conducted at Yildiz Technical University/T\u0026uuml;rkiye. T.G. acknowledges support from the Scientific and Technological Research Council of T\u0026uuml;rkiye (TUBITAK) under 2247-A National Outstanding Researchers Program grant number 123C512. The authors thank Dr. Ugur Ozturk for his support during the risk analysis process.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interests\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare no competing interests.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eContributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eEN is the corresponding author and contributed to the methodology, data collection, investigation, and formal analysis. EN wrote the first draft of the paper. SF and TG contributed to the visualization, validation, risk analysis, and writing of the manuscript. TG and FB contributed to the writing, supervision, and review of the manuscript. All authors contributed to the interpretation of the results, editing, and revision of the manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCode and data availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe data and code for the research can be accessed by https://github.com/Elnaz66/webscrap (Najatishendi 2025).\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eAllan S, Adam B, Carter C (2013) Introduction The media politics of environmental risk. In: Environmental risks and the media. Routledge, pp 1\u0026ndash;26\u003c/li\u003e\n\u003cli\u003eAristiz\u0026aacute;bal E, S\u0026aacute;nchez O (2020) Spatial and temporal patterns and the socioeconomic impacts of landslides in the tropical and mountainous Colombian Andes. Disasters 44:596\u0026ndash;618. https://doi.org/10.1111/disa.12391\u003c/li\u003e\n\u003cli\u003eAvcıoğlu A, Demir O, G\u0026ouml;r\u0026uuml;m T (2025) An automated approach for developing geohazard inventories using news : Integrating NLP , machine learning , and mapping . 2015:1\u0026ndash;21\u003c/li\u003e\n\u003cli\u003eBattistini A, Rosi A, Segoni S, et al (2017) Validation of landslide hazard models using a semantic engine on online news. Appl Geogr 82:59\u0026ndash;65. https://doi.org/10.1016/j.apgeog.2017.03.003\u003c/li\u003e\n\u003cli\u003eBattistini A, Segoni S, Manzo G, et al (2013) Web data mining for automatic inventory of geohazards at national scale. Appl Geogr 43:147\u0026ndash;158. https://doi.org/10.1016/j.apgeog.2013.06.012\u003c/li\u003e\n\u003cli\u003eBhuyan K, Tanyaş H, Nava L, et al (2023) Generating multi-temporal landslide inventories through a general deep transfer learning strategy using HR EO data. Sci Rep 13:1\u0026ndash;26. https://doi.org/10.1038/s41598-022-27352-y\u003c/li\u003e\n\u003cli\u003eBird S, Klein E, Loper E (2009) Natural language processing with Python: analyzing text with the natural language toolkit. \u0026lsquo; O\u0026rsquo;Reilly Media, Inc.\u0026rsquo;\u003c/li\u003e\n\u003cli\u003eBrunetti MT, Gariano SL, Melillo M, et al (2025) An enhanced rainfall-induced landslide catalogue in Italy. Sci data 12:216. https://doi.org/10.1038/s41597-025-04551-6\u003c/li\u003e\n\u003cli\u003eCaleca F, Lombardo L, Steger S, et al (2025) Pan-European Landslide Risk Assessment: From Theory to Practice. Rev Geophys 63:1\u0026ndash;45. https://doi.org/10.1029/2023RG000825\u003c/li\u003e\n\u003cli\u003eCalvello M, Pecoraro G (2018) FraneItalia: a catalog of recent Italian landslides. Geoenvironmental Disasters 5:. https://doi.org/10.1186/s40677-018-0105-5\u003c/li\u003e\n\u003cli\u003eCarley KM, Malik M, Landwehr PM, et al (2016) Crowd sourcing disaster management: The complex nature of Twitter usage in Padang Indonesia. Saf Sci 90:48\u0026ndash;61. https://doi.org/10.1016/j.ssci.2016.04.002\u003c/li\u003e\n\u003cli\u003eCarrara A, Crosta G, Frattini P (2003) Geomorphological and historical data in assessing landslide hazard. Earth Surf Process Landforms 28:1125\u0026ndash;1142. https://doi.org/10.1002/esp.545\u003c/li\u003e\n\u003cli\u003eChauhan R, Negi A, Manchanda M (2023) An Extensive Review on Web Scraping Technique using Python. Proc 2023 2nd Int Conf Augment Intell Sustain Syst ICAISS 2023 1134\u0026ndash;1138. https://doi.org/10.1109/ICAISS58487.2023.10250745\u003c/li\u003e\n\u003cli\u003eChow TE, Dede-Bamfo N, Dahal KR (2016) Geographic disparity of positional errors and matching rate of residential addresses among geocoding solutions. Ann GIS 22:29\u0026ndash;42. https://doi.org/10.1080/19475683.2015.1085437\u003c/li\u003e\n\u003cli\u003eCording PH (2011) Algorithms for Web Scraping. 104\u003c/li\u003e\n\u003cli\u003eCorominas J, van Westen C, Frattini P, et al (2014) Recommendations for the quantitative analysis of landslide risk. Bull Eng Geol Environ 73:209\u0026ndash;263. https://doi.org/10.1007/s10064-013-0538-8\u003c/li\u003e\n\u003cli\u003eDamm B, Klose M (2015) The landslide database for Germany: Closing the gap at national level. Geomorphology 249:82\u0026ndash;93. https://doi.org/10.1016/j.geomorph.2015.03.021\u003c/li\u003e\n\u003cli\u003eDepicker A, Jacobs L, Mboga N, et al (2021) Historical dynamics of landslide risk from population and forest-cover changes in the Kivu Rift. Nat Sustain. https://doi.org/10.1038/s41893-021-00757-9\u003c/li\u003e\n\u003cli\u003eEmberson R, Kirschbaum D, Amatya P, et al (2022) Insights from the topographic characteristics of a large global catalog of rainfall-induced landslide event inventories. Nat Hazards Earth Syst Sci Discuss 1\u0026ndash;33\u003c/li\u003e\n\u003cli\u003eFang Z, Tanyas H, Gorum T, et al (2023) Speech-recognition in landslide predictive modelling: A case for a next generation early warning system. Environ Model Softw 170:105833. https://doi.org/10.1016/j.envsoft.2023.105833\u003c/li\u003e\n\u003cli\u003eFidan S, G\u0026ouml;r\u0026uuml;m T (2020) T\u0026uuml;rkiye\u0026rsquo;de \u0026Ouml;l\u0026uuml;mc\u0026uuml;l Heyelanların Dağılım Karakteristikleri ve Ulusal \u0026Ouml;l\u0026ccedil;ekte \u0026Ouml;ncelikli Alanların Belirlenmesi. T\u0026uuml;rk Coğrafya Derg 74:123\u0026ndash;134. https://doi.org/10.17211/tcd.731596\u003c/li\u003e\n\u003cli\u003eFidan S, Tanyaş H, Akbaş A, et al (2024) Understanding fatal landslides at global scales: a summary of topographic, climatic, and anthropogenic perspectives. Nat Hazards 120:6437\u0026ndash;6455. https://doi.org/10.1007/s11069-024-06487-3\u003c/li\u003e\n\u003cli\u003eFranceschini R, Rosi A, Catani F, Casagli N (2022) Exploring a landslide inventory created by automated web data mining: the case of Italy. Landslides 19:841\u0026ndash;853. https://doi.org/10.1007/s10346-021-01799-y\u003c/li\u003e\n\u003cli\u003eFroude MJ, Petley DN (2018) Global fatal landslide occurrence from 2004 to 2016. Nat Hazards Earth Syst Sci 18:2161\u0026ndash;2181. https://doi.org/10.5194/nhess-18-2161-2018\u003c/li\u003e\n\u003cli\u003eGarcia-Delgado H, Petley DN, Berm\u0026uacute;dez MA, Sep\u0026uacute;lveda SA (2022) Fatal landslides in Colombia (from historical times to 2020) and their socio-economic impacts. Landslides 19:1689\u0026ndash;1716. https://doi.org/10.1007/s10346-022-01870-2\u003c/li\u003e\n\u003cli\u003eG\u0026oacute;mez D, Garc\u0026iacute;a EF, Aristiz\u0026aacute;bal E (2023) Spatial and temporal landslide distributions using global and open landslide databases. Springer Netherlands\u003c/li\u003e\n\u003cli\u003eG\u0026ouml;r\u0026uuml;m T, Bozkurt D, Korup O, et al (2025) The 2023 T\u0026uuml;rkiye-Syria earthquake disaster was exacerbated by an atmospheric river. Commun Earth Environ 6:1\u0026ndash;10. https://doi.org/10.1038/s43247-025-02111-9\u003c/li\u003e\n\u003cli\u003eGorum T, Fan X, van Westen CJ, et al (2011) Distribution pattern of earthquake-induced landslides triggered by the 12 May 2008 Wenchuan earthquake. Geomorphology 133:152\u0026ndash;167. https://doi.org/10.1016/j.geomorph.2010.12.030\u003c/li\u003e\n\u003cli\u003eG\u0026ouml;r\u0026uuml;m T, Fidan S (2021) Spatiotemporal variations of fatal landslides in Turkey. 1691\u0026ndash;1705. https://doi.org/10.1007/s10346-020-01580-7\u003c/li\u003e\n\u003cli\u003eGoswami S, Chakraborty S, Ghosh S, et al (2018) A review on application of data mining techniques to combat natural disasters. Ain Shams Eng J 9:365\u0026ndash;378. https://doi.org/10.1016/j.asej.2016.01.012\u003c/li\u003e\n\u003cli\u003eGuns M, Vanacker V (2014) Shifts in landslide frequency-area distribution after forest conversion in the tropical Andes. Anthropocene 6:75\u0026ndash;85. https://doi.org/10.1016/j.ancene.2014.08.001\u003c/li\u003e\n\u003cli\u003eGuzzetti F (2000) Landslide fatalities and the evaluation of landslide risk in Italy. Eng Geol 58:89\u0026ndash;107. https://doi.org/10.1016/S0013-7952(00)00047-8\u003c/li\u003e\n\u003cli\u003eGuzzetti F, Cardinali M, Reichenbach P (1994) The AVI project: A bibliographical and archive inventory of landslides and floods in Italy. Environ Manage 18:623\u0026ndash;633. https://doi.org/10.1007/BF02400865\u003c/li\u003e\n\u003cli\u003eGuzzetti F, Gariano SL, Peruccacci S, et al (2020) Geographical landslide early warning systems. Earth-Science Rev 200:102973. https://doi.org/10.1016/j.earscirev.2019.102973\u003c/li\u003e\n\u003cli\u003eGuzzetti F, Mondini AC, Cardinali M, et al (2012) Landslide inventory maps: New tools for an old problem. Earth-Science Rev 112:42\u0026ndash;66. https://doi.org/10.1016/j.earscirev.2012.02.001\u003c/li\u003e\n\u003cli\u003eGuzzetti F, Stark CP, Salvati P (2005) Evaluation of flood and landslide risk to the population of Italy. Environ Manage 36:15\u0026ndash;36. https://doi.org/10.1007/s00267-003-0257-1\u003c/li\u003e\n\u003cli\u003eGuzzetti F, Tonelli G (2004) Information system on hydrological and geomorphological catastrophes in Italy (SICI): A tool for managing landslide and flood hazards. Nat Hazards Earth Syst Sci 4:213\u0026ndash;232. https://doi.org/10.5194/nhess-4-213-2004\u003c/li\u003e\n\u003cli\u003eHaque U, Blum P, da Silva PF, et al (2016) Fatal landslides in Europe. Landslides 13:1545\u0026ndash;1554. https://doi.org/10.1007/s10346-016-0689-3\u003c/li\u003e\n\u003cli\u003eHaque U, da Silva PF, Devoli G, et al (2019) The human cost of global warming: Deadly landslides and their triggers (1995\u0026ndash;2014). Sci Total Environ 682:673\u0026ndash;684. https://doi.org/10.1016/j.scitotenv.2019.03.415\u003c/li\u003e\n\u003cli\u003eHerv\u0026aacute;s J (2013) Landslide Inventory. In: Bobrowsky PT (ed) Encyclopedia of Natural Hazards. Springer Netherlands, Dordrecht, pp 610\u0026ndash;611\u003c/li\u003e\n\u003cli\u003eInnocenzi E, Greggio L, Frattini P, de Amicis M (2017) A Web-Based Inventory of Landslides Occurred in Italy in the Period 2012--2015. In: Mikos M, Tiwari B, Yin Y, Sassa K (eds) Advancing Culture of Living with Landslides. Springer International Publishing, Cham, pp 1127\u0026ndash;1133\u003c/li\u003e\n\u003cli\u003eJurafsky M (2020) Speech and Language Processing An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models Third Edition draft Summary of Contents. vii\u0026ndash;x\u003c/li\u003e\n\u003cli\u003eKang Y, Cai Z, Tan CW, et al (2020) Natural language processing (NLP) in management research: A literature review. J Manag Anal 7:139\u0026ndash;172. https://doi.org/10.1080/23270012.2020.1756939\u003c/li\u003e\n\u003cli\u003eKilic B, Hacar M, G\u0026uuml;lgen F (2023) Effects of reverse geocoding on OpenStreetMap tag quality assessment. Trans GIS 27:1599\u0026ndash;1613. https://doi.org/10.1111/tgis.13089\u003c/li\u003e\n\u003cli\u003eKirschbaum D, Adler R, Adler D, et al (2012) Global Distribution of Extreme Precipitation and High-Impact Landslides in 2010 Relative to Previous Years. J Hydrometeorol 13:1536\u0026ndash;1551. https://doi.org/10.1175/JHM-D-12-02.1\u003c/li\u003e\n\u003cli\u003eKirschbaum D, Stanley T, Zhou Y (2015) Spatial and temporal analysis of a global landslide catalog. Geomorphology 249:4\u0026ndash;15. https://doi.org/10.1016/j.geomorph.2015.03.016\u003c/li\u003e\n\u003cli\u003eKirschbaum DB, Adler R, Hong Y, et al (2010) A global landslide catalog for hazard applications: method, results, and limitations. Nat Hazards 52:561\u0026ndash;575. https://doi.org/10.1007/s11069-009-9401-4\u003c/li\u003e\n\u003cli\u003eKirschbaum DB, Adler R, Hong Y, Lerner-Lam A (2009) Evaluation of a preliminary satellite-based landslide hazard algorithm using global landslide inventories. Nat Hazards Earth Syst Sci 9:673\u0026ndash;686. https://doi.org/10.5194/nhess-9-673-2009\u003c/li\u003e\n\u003cli\u003eKlose M, Maurischat P, Damm B (2016) Landslide impacts in Germany: A historical and socioeconomic perspective. Landslides 13:183\u0026ndash;199. https://doi.org/10.1007/s10346-015-0643-9\u003c/li\u003e\n\u003cli\u003eKoltsakis E, Klontzas ME, Karantanas AH (2023) What Is Artificial Intelligence: History and Basic Definitions\u003c/li\u003e\n\u003cli\u003eKreuzer TM, Damm B (2020) Automated digital data acquisition for landslide inventories. Landslides 17:2205\u0026ndash;2215. https://doi.org/10.1007/s10346-020-01431-5\u003c/li\u003e\n\u003cli\u003eKumar LA, Renuka DK (2023) State-of-the-Art Natural Language Processing. Deep Learn Approach Nat Lang Process Speech, Comput Vis 49\u0026ndash;75. https://doi.org/10.1201/9781003348689-3\u003c/li\u003e\n\u003cli\u003eLai K, Porter JR, Amodeo M, et al (2022) A Natural Language Processing Approach to Understanding Context in the Extraction and GeoCoding of Historical Floods, Storms, and Adaptation Measures. Inf Process Manag 59:102735. https://doi.org/10.1016/j.ipm.2021.102735\u003c/li\u003e\n\u003cli\u003eLausch A, Schmidt A, Tischendorf L (2015) Data mining and linked open data - New perspectives for data analysis in environmental research. Ecol Modell 295:5\u0026ndash;17. https://doi.org/10.1016/j.ecolmodel.2014.09.018\u003c/li\u003e\n\u003cli\u003eLebakula V, Epting J, Moehl J, et al (2024) LandScan Silver Edition\u003c/li\u003e\n\u003cli\u003eLin Q, Wang Y (2018) Spatial and temporal analysis of a fatal landslide inventory in China from 1950 to 2016. Landslides 15:2357\u0026ndash;2372. https://doi.org/10.1007/s10346-018-1037-6\u003c/li\u003e\n\u003cli\u003eMaes J, Kervyn M, de Hontheim A, et al (2017) Landslide risk reduction measures: A review of practices and challenges for the tropics. Prog Phys Geogr 41:191\u0026ndash;221. https://doi.org/10.1177/0309133316689344\u003c/li\u003e\n\u003cli\u003eManning CD, Bauer J, Finkel J, Bethard SJ (2014) The Stanford CoreNLP Natural Language Processing Toolkit. AclwebOrg 55\u0026ndash;60\u003c/li\u003e\n\u003cli\u003eMirus BB, Jones ES, Baum RL, et al (2020) Landslides across the USA: occurrence, susceptibility, and data limitations. Landslides 17:2271\u0026ndash;2285. https://doi.org/10.1007/s10346-020-01424-4\u003c/li\u003e\n\u003cli\u003eMoeller SD (2006) \u0026lsquo;Regarding the Pain of Others\u0026rsquo;: Media, Bias and the Coverage of International Disasters. J Int Aff 59:173\u0026ndash;XVI\u003c/li\u003e\n\u003cli\u003eNajatishendi E (2025) Automated extraction of landslide events from Turkish news articles (Version 0.1.0) [Software]. https://github.com/Elnaz66/webscrap\u003c/li\u003e\n\u003cli\u003eOzturk U, Bozzolan E, Holcombe EA, et al (2022) How climate change and unplanned urban sprawl bring more landslides. Nature 608:262\u0026ndash;265. https://doi.org/10.1038/d41586-022-02141-9\u003c/li\u003e\n\u003cli\u003ePetley D (2012) Global patterns of loss of life from landslides. Geology 40:927\u0026ndash;930. https://doi.org/10.1130/G33217.1\u003c/li\u003e\n\u003cli\u003eRaffel C, Shazeer N, Roberts A, et al (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 21:1\u0026ndash;67\u003c/li\u003e\n\u003cli\u003eRossi M, Guzzetti F, Salvati P, et al (2019) A predictive model of societal landslide risk in Italy. Earth-Science Rev 196:102849. https://doi.org/10.1016/j.earscirev.2019.04.021\u003c/li\u003e\n\u003cli\u003eSep\u0026uacute;lveda SA, Petley DN (2015) Regional trends and controlling factors of fatal landslides in Latin America and the Caribbean. Nat Hazards Earth Syst Sci 15:1821\u0026ndash;1833. https://doi.org/10.5194/nhess-15-1821-2015\u003c/li\u003e\n\u003cli\u003eSpizzichino D, Margottini C, Trigila A, et al (2010) Chapter 9: landslides. Eur Environ Agency Mapp impacts Nat hazards Technol Accid Eur An Overv last Decad EEA Tech Rep 13:81\u0026ndash;93\u003c/li\u003e\n\u003cli\u003eTanyaş H, van Westen CJ, Allstadt KE, et al (2017) Presentation and Analysis of a Worldwide Database of Earthquake-Induced Landslide Inventories. J Geophys Res Earth Surf 122:1991\u0026ndash;2015. https://doi.org/10.1002/2017JF004236\u003c/li\u003e\n\u003cli\u003eTaylor FE, Malamud BD, Freeborough K, Demeritt D (2015) Enriching Great Britain\u0026rsquo;s National Landslide Database by searching newspaper archives. Geomorphology 249:52\u0026ndash;68. https://doi.org/10.1016/j.geomorph.2015.05.019\u003c/li\u003e\n\u003cli\u003eVan Den Eeckhaut M, Herv\u0026aacute;s J (2012) State of the art of national landslide databases in Europe and their potential for assessing landslide susceptibility, hazard and risk. Geomorphology 139\u0026ndash;140:545\u0026ndash;558. https://doi.org/10.1016/j.geomorph.2011.12.006\u003c/li\u003e\n\u003cli\u003evan Westen CJ, van Asch TWJ, Soeters R (2006) Landslide hazard and risk zonation - Why is it still so difficult? Bull Eng Geol Environ 65:167\u0026ndash;184. https://doi.org/10.1007/s10064-005-0023-0\u003c/li\u003e\n\u003cli\u003eVargiu E, Urru M (2012) Exploiting web scraping in a collaborative filtering- based approach to web advertising. Artif Intell Res 2:44\u0026ndash;54. https://doi.org/10.5430/air.v2n1p44\u003c/li\u003e\n\u003cli\u003eVarnes DJ (1984) Landslide hazard zonation: a review of principles and practice\u003c/li\u003e\n\u003cli\u003eYacouby R, Axman D (2020) Probabilistic extension of precision, recall, and f1 score for more thorough evaluation of classification models. In: Proceedings of the first workshop on evaluation and comparison of NLP systems. pp 79\u0026ndash;91\u003c/li\u003e\n\u003cli\u003eYoung T, Hazarika D, Poria S, Cambria E (2018) Recent trends in deep learning based natural language processing [Review Article]. IEEE Comput Intell Mag 13:55\u0026ndash;75. https://doi.org/10.1109/MCI.2018.2840738\u003c/li\u003e\n\u003cli\u003eZhang S, Li C, Peng J, et al (2023) Fatal landslides in China from 1940 to 2020: occurrences and vulnerabilities. Landslides. https://doi.org/10.1007/s10346-023-02034-6\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"natural-hazards","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"nhaz","sideBox":"Learn more about [Natural Hazards](https://www.springer.com/journal/11069)","snPcode":"11069","submissionUrl":"https://submission.nature.com/new-submission/11069/3","title":"Natural Hazards","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"Landslides, Landslide inventory, Web scraping, Natural language processing, Geocoding","lastPublishedDoi":"10.21203/rs.3.rs-7463555/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7463555/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eLandslides are among the most frequent natural hazards that cause significant loss of life and serious economic damage worldwide. Although many inventories have been created using different approaches to understand landslide events, these are rarely updated automatically or in real time. Traditional approaches are laborious processes due to the time and intensive labor requirements, and are limited in terms of timeliness due to reporting delays. To address these challenges, we developed an automated approach that integrates web scraping, natural language processing (NLP), and geocoding techniques using digital media news sources in T\u0026uuml;rkiye to create a landslide archive inventory. Our algorithm verified 1727 of the 3051 news articles it captured between 1997 and 2024 as landslides and identified a total of 478 fatalities in 212 deadly incidents. 66.5% of the landslides captured on the web were located at the neighborhood/village level, providing substantial spatial accuracy. This location accuracy has also enabled risk estimation at the neighborhood/village level. Comparison with the manual national inventory shows moderate agreement, with F1 scores ranging from 0.434 to 0.552 in \u0026plusmn;\u0026thinsp;1 and \u0026plusmn;\u0026thinsp;7 daytime windows. The automated method not only captures spatial and temporal patterns of landslides but also extracts key attributes such as location, number of fatalities, and triggering factors (i.e., natural and anthropogenic). Our study demonstrates the potential of web-based automated approaches to complement traditional landslide inventories by providing near-real-time and verifiable data. Finally, we suggest adopting common reporting standards for natural hazard digital newspapers so that this approach can spread globally.\u003c/p\u003e","manuscriptTitle":"Generating Landslide Archive Inventories Using Web Scraping and NLP Techniques for Türkiye","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-09-04 22:15:34","doi":"10.21203/rs.3.rs-7463555/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Major revisions","date":"2025-10-22T17:14:06+00:00","index":"","fulltext":""},{"type":"reviewerAgreed","content":"","date":"2025-08-28T18:10:21+00:00","index":0,"fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-08-28T15:58:08+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-08-27T07:58:41+00:00","index":"","fulltext":""},{"type":"submitted","content":"Natural Hazards","date":"2025-08-26T09:31:36+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"natural-hazards","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"nhaz","sideBox":"Learn more about [Natural Hazards](https://www.springer.com/journal/11069)","snPcode":"11069","submissionUrl":"https://submission.nature.com/new-submission/11069/3","title":"Natural Hazards","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"9a2423a0-d75d-41ec-9dc5-edfa54504e46","owner":[],"postedDate":"September 4th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[],"tags":[],"updatedAt":"2025-12-29T16:00:36+00:00","versionOfRecord":{"articleIdentity":"rs-7463555","link":"https://doi.org/10.1007/s11069-025-07753-8","journal":{"identity":"natural-hazards","isVorOnly":false,"title":"Natural Hazards"},"publishedOn":"2025-12-26 15:57:37","publishedOnDateReadable":"December 26th, 2025"},"versionCreatedAt":"2025-09-04 22:15:34","video":"","vorDoi":"10.1007/s11069-025-07753-8","vorDoiUrl":"https://doi.org/10.1007/s11069-025-07753-8","workflowStages":[]},"version":"v1","identity":"rs-7463555","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7463555","identity":"rs-7463555","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.