Standardizing Free-Text Data Exemplified by Age and Data-Location Fields in the Immune Epitope Database

doi:10.21203/rs.3.rs-5363542/v1

Standardizing Free-Text Data Exemplified by Age and Data-Location Fields in the Immune Epitope Database

2024 · doi:10.21203/rs.3.rs-5363542/v1

preprint OA: closed

Full text JSON View at publisher

Full text 142,624 characters · extracted from preprint-html · click to expand

Standardizing Free-Text Data Exemplified by Age and Data-Location Fields in the Immune Epitope Database | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Standardizing Free-Text Data Exemplified by Age and Data-Location Fields in the Immune Epitope Database Sebastian Duesing, Jason Bennett, James A. Overton, Randi Vita, and 1 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-5363542/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 22 Mar, 2025 Read the published version in Journal of Biomedical Semantics → Version 1 posted 12 You are reading this latest preprint version Abstract Background While unstructured data, such as free text, constitutes a large amount of publicly available biomedical data, it is underutilized in automated analyses due to the difficulty of extracting meaning from it. Normalizing free-text data, i.e. , removing inessential variance, enables the use of structured vocabularies like ontologies to represent the data and allow for harmonized queries over it. This paper presents an adaptable tool for free-text normalization and an evaluation of the application of this tool to two different sets of unstructured biomedical data curated from the literature in the Immune Epitope Database (IEDB): age and data-location. Results Free text entries for the database fields for subject age (4095 distinct values) and publication data-location (251,810 distinct values) in the IEDB were analyzed. Normalization was performed in three steps, namely character normalization, word normalization, and phrase normalization, using generalizable rules developed and applied with the tool presented in this manuscript. For the age dataset, in the character stage, the application of 21 rules resulted in 99.97% output validity; in the word stage, the application of 94 rules resulted in 98.06% output validity; and in the phrase stage, the application of 16 rules resulted in 83.81% output validity. For the data-location dataset, in the character stage, the application of 39 rules resulted in 99.99% output validity; in the word stage, the application of 187 rules resulted in 98.46% output validity; and in the phrase stage, the application of 12 rules resulted in 97.95% output validity. Conclusions We developed a generalizable approach for normalization of free text as found in database fields with content on a specific topic. Creating and testing the rules took a one-time effort for a given field that can now be applied to data as it is being curated. The standardization achieved in two datasets tested produces significantly reduced variance in the content which enhances the findability and usability of that data, chiefly by improving search functionality and enabling linkages with formal ontologies. Unstructured data free-text data data normalization data standardization Immune Epitope Database ontology. Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Background A lot of data within and outside the biomedical field is unstructured, with estimates ranging as high as 95% [1]. Unstructured data is commonly underutilized due to the difficulty of automatically extracting meaningful information. In our work on the Immune Epitope Database (IEDB) [2], we have found that the unstructured data also lags behind structured data in adherence to FAIR data standards; in a 2018 analysis of the IEDB’s progress towards improved data FAIRness, an area identified for improvement was the findability of unstructured free-text data [3]. Normalizing free-text data, i.e. , removing variance that does not affect meaning from text, enables linkages between the unstructured data and structured vocabularies like ontologies, which can significantly improve the FAIRness and usability of the data. This paper presents a novel repository of Python scripts for free-text data normalization and an evaluation of the application of these scripts to two different sets of biomedical data from the IEDB, an age dataset and a data-location dataset. Variance, a term that this paper uses to refer to differences in representations of information that do not change meaning, is a key problem of free-text normalization. Free-text data can contain several different kinds of variance. Character variance (such as differences in diacritic usage, whitespace, or encoding) differentiates data items like “6–8 weeks” and “6–8 weeks”. Word-level variance, which includes misspellings, abbreviations, synonyms, and colloquialisms, differentiates data items like “6–8 weeks” and “6–8 wks”. Phrase-level variance includes the ways that one idea can be expressed with different permutations of words, and it differentiates data items like “6–8 weeks” and “6 to 8 weeks”. The data items “6–8 weeks”, “6–8 weeks”, “6–8 wks”, and “6 to 8 weeks” all mean the same thing, but in their unstandardized free-text forms, they are all parsed as distinct. The aim of free-text normalization is to ensure that data items that mean the same thing look the same way. The extent to which each of those three types of variance might exist in a particular dataset is highly dependent on the nature of the data. To be broadly applicable to free-text datasets of all sorts, a free-text normalization tool must be able to address all three types of variance in a way that is flexible enough to account for different datasets’ unique normalization needs. There is a robust history of development of automated tools for addressing some types of variance, such as spell-check technologies, but there are comparatively few holistic tools designed to normalize dataset variance at the character, word, and phrase levels. To that end, we created the free-text normalization tool ADP, which stands for Adaptable, user-Dependent, and Precise. In this paper, we examine the application of ADP to two free-text datasets from the IEDB: the age dataset and the data-location dataset, both of which were accessed using SQL queries. The age dataset records the ages of subjects in investigations archived in the IEDB. It contains 7,151 total unique organism-age pairs (e.g., age: “6–8 weeks old”, organism name: “Mus musculus C57BL/6”), meaning some age values are duplicated in that dataset because they occur with multiple organisms; there are 4,095 unique age value strings. Strings in the age dataset typically contained one piece of information per string, and where list-like strings were present, they were legitimate lists ostensibly linked to studies that investigated subjects at multiple specific ages, e.g., the data item “21, 27 and 36 weeks”. The data-location dataset records the manuscript locations in which certain data are found. It contains 251,810 unique data-location strings, such as “Cited reference [PMID: 16472860]”. In contrast with the age dataset, many strings in this dataset contained several individually valid data locations in a single line, such as “Data set S1 and S11 and Figs. 1 , 2 , 3 , and 4 ”. Table 1 Example Age & Data-Location Data Items Age Dataset Data-Location Dataset 6 to 8 weeks Figures 2 , 3 , 4 , 5, 6 , S4, S5, S7, Tables 2 , 3 , 4 and 5 Adults (pregnant) PDB: 5EC1, 5EC2, 5EBW, 5EBL, 5EBM Mean age of 32.2 years with a range from 18 to 49 years Richardson et al. Virol 1986;155:508–523 [PMID: 3788062] 18–22 months or 4–6 months pg. 1410 and J. Virol. 61:1358–1367 Methods ADP is a non-fully-automated normalization tool that enables a user to create standardization rules and apply them to datasets, which is available on GitHub [4]. The ADP normalization scripts are written in Python version 3.10. The core normalization scripts import the libraries os, re, and sys from the Python Standard Library and the non-native library editdistance (imported as ed). ADP is open-source software licensed under GNU GPL-3.0. ADP’s three core normalization scripts (char_normalizer.py, word_normalizer.py, and phrase_normalizer.py) address the three types of variance outlined in the introduction: character-, word-, and phrase-level variance. At the character and word stages, ADP also logs a Levenshtein distance score for each data item to indicate the extent of the changes made in that stage. ADP uses a script (calculate_metrics.py) to pull relevant metrics from the normalized output files and generate figures using the Python libraries ast, math, matplotlib.pyplot (imported as plt), pandas (imported as pd), seaborn (imported as sns), and warnings. ADP Text Normalization Workflow Action Decision-Based Normalization of Characters and Words While standardizing character variance can be as simple as selecting acceptable special characters and determining case-sensitivity of the data, standardizing word-level variance involves identifying and correcting misspellings in free-text data, a process which is well-known to be “cumbersome” [5]. Normalization tools must also be able to handle “non-standard words,” including numbers, acronyms, and other abbreviations [6]. Some existing word normalization tools overcorrect and have higher rates of “unresolved errors,” or incorrectly-spelled words that the tool swaps with a context-incorrect word; others tend to undercorrect, e.g., by failing to recognize “cant” as a misspelling of “can’t” [5]. ADP uses an iterative character and word normalization process designed to prioritize accuracy of outputs. The character and word normalization scripts share a similar rule-building workflow. When one of these two scripts is run on a dataset for the first time, it identifies distinct text units (characters or words, which for ADP’s purposes is a sequence of characters delineated by one of several common separators, like hyphens, spaces, or punctuation, or the start or end of a string) and creates a review file to be used for normalization rule-setting. The review file is a TSV containing one row for each distinct character—except lowercase letters, digits, and a small number of basic punctuation characters, which are treated as valid for character normalization—or word found in the file. It has columns for the character or word, its context (i.e., the data item strings in which that character or word was found), and a count of its occurrences. The review file also has four action columns with the headings “replace_with”, “remove”, “invalidate”, and “allow”. Entering text in one of the action columns (which we refer to as “making an action decision”) sets a rule for the behavior of the script concerning the character or word in that row during future runs of the script. Table 2 describes how entering text in one of the action columns modifies the behavior of the script. Table 2 Action Decisions Action Column Function replace_with This character or word is replaced with the text that is entered in this column. remove This character or word is removed from the data items in which it occurs. invalidate This character or word remains as-is, and data items containing this character or word will fail validation. allow This character or word remains as-is, and this character or word is considered an accepted text unit for validation. Every time the script is rerun, it moves any review file rows in which an action decision has been made to a reference file, which serves as a bank of rules for the behavior of the script. Tables 3 and 4 contain examples of the rules applied to these datasets at the character and word stages. To see all normalization rules applied at the character and word stages, please refer to the reference files in the ADP repository [4]. Table 3 Sample Character Normalization Rules & Applications to Data Items Dataset Char. Occurrences Example string Rule Post-normalization string age = 65 “mean age = 30 years” Allow “mean age = 30 years” age – 31 “20–67 years” Replace with: - “20–67 years” data-loc & 53 “Abstract & p. 664” Replace with: and “abstract and p. 664” data-loc € 10 “Figure 1 and Fig. 1â€”figure supplement 1 and PDB 6HD8” Invalidate Invalid, not normalized Table 4 Sample Word Normalization Rules & Applications to Data Items Dataset Word Occurrences Example string Rule Post-normalization string age old 710 “6–10 week old” Remove “6–10 week” age wk 57 “8–10 wk” Replace with: week “8–10 week” data-loc fig 285 “Figs. 1 and 2 ” Replace with: figure “figure 1 and 2 ” data-loc file 148 “additional file 1” Allow “additional file 1” Following the transfer of rows with new action decisions from the review file to the reference file, the script runs its normalization functions, applying the rules based on the user’s action decisions to the dataset, and it checks for any new text units that do not have a line in either the review file or the reference. See Fig. 1 for a visual representation of how this process works during the character normalization stage. In the character normalization stage, data items pass validation if in the second reference check (as shown in the diagram), only allowed characters are found in the string; otherwise, validation fails. Only data items that pass character-level validation are normalized in the word normalization stage. Data items pass word-level validation if in the second reference check, only allowed words are found in the string; otherwise, validation fails. Pattern-Based Normalization of Phrases ADP phrase normalization uses a process of matching phrase structures to user-defined patterns. This process begins in the word normalization stage. In the word review and reference TSV, there is an additional “category” column. Adding text to this column in the row of a particular word asserts the category to which that word belongs, e.g. , in rows for the words “week”, “month”, and “year”, the category has been set to “unit” in the word reference TSV for the age dataset. When the phrase normalization script is called, it divides the data item into individual words as was done for the word normalization phase. The script tracks the word’s place in the string and any delimiters (including punctuation, whitespace, and the start or end of a string) on either side of the word. Then, it searches the word reference file to see if a category has been assigned to the word; if not, it categorizes the word as “unknown”. The script produces a string that uses a simple grammar to indicate the categories of each word and their position in the string, e.g., the age datum “6 week mean” is parsed as “[number(0)][unit(1)][statistical(2)]”. The phrase categorization string is stored in a dedicated column in the phrase normalization output file to enable the user to determine which phrase structures occur the most frequently in a dataset and develop normalization rules accordingly. Like the character and word normalization phases, the phrase normalization phase depends on the user to create rules for distinct phrase structures. A dataset’s phrase-type ruleset (found in age_phrase_types.tsv and data_loc_phrase_types.tsv) establishes a name for a pattern, indicates whether or not it is a valid pattern (e.g., in the age dataset, a data item consisting of a number and a unit is valid, but a number by itself is not, as being unitless makes its meaning uncertain), and sets a rule for how phrases that match that pattern should be formatted. See Table 5 for examples. The categorization string, e.g., [number(0)][unit(1)][statistical(2)] (extracted from “6 week mean”), is matched to a pattern—in this case, the pattern called “statistical”—which matches to the structures of data items that provide a mean or median age value. In the “standard_form” column in the phrase-type ruleset, the user can specify how data items matching a pattern should be formatted. In the case of “6 week mean”, the standard form is represented as “[2]: [0] [1]”, in which the numbers in brackets refer to the indices from the categorization string, and how they should be arranged within the standard form string. The phrase normalization script generates a blank phrase-type ruleset file if none exists, but if one exists, it checks each data item’s categorization string against any patterns in the file and applies the pattern in the “standard_form” column if applicable by inserting words where their indices are placed in the standard form string. Through this process, “6 week mean” is rearranged to match the standard form string “[2]: [0] [1]”, so the output for that data item is “mean: 6 week”. This workflow ensures that data items with diverse structures, like “6 week mean” and “mean = 6 week”, take on a single standard phrase structure, like “mean: 6 week”. The specific structure we chose for data items of this type is arbitrary; the crucial part is the ability to quickly modify diversely expressed data items into one standard style. Table 5 below contains sample rows from both datasets’ phrase type tables as examples of the rules applied to these datasets. To see all normalization rules applied at the phrase stage, please refer to the phrase type files in the ADP repository [4]. Table 5 Sample Phrase Normalization Rules Dataset Pattern name Pattern Standard form Example matched phrases Example normalized phrases age range [number(0)] [range_indicator(1)] [number(2)] [unit(3)] [0]-[2] [3] “6 to 8-week”, “44.9 to 74.1 year”, “36 to 68.2 year” “6–8 week”, “44.9–74.1 year”, “36-68.2 year” age statistical [statistical(0)] [number(1)] [unit(2)] [0]: [1] [2] “mean 29.8 year”, “mean: 30 year”, “median : 7.5 year” “mean: 29.8 year”, “mean: 30 year”, “median: 7.5 year” data-loc pdb id [pdb(0)] [pdb_id(1)] [0] [1] “pdb 1mfd”, “pdb 1rzj”, “pdb 1rzk” “pdb 1mfd”, “pdb 1rzj”, “pdb 1rzk” data-loc loc number [location(0)] [number(1)] [0] [1] “page 11782”, “information 9”, “data 1” “page 11782”, “information 9”, “data 1” Only data items passing validation at the character and word stages are normalized at the phrase stage. Data items pass phrase-level validation if they match a pattern designated as valid in the phrase-type ruleset. Otherwise, validation fails. The phrase normalization and validation processes are visualized in the flowchart below. Measuring String Change During Normalization The ADP normalization code imports the package editdistance to measure the Levenshtein distance between the inputs and outputs in the character and word normalization stages. The normalized output files dedicated columns for distance scores comparing the character-normalized string against the original and the word-normalized string against the character-normalized string. Due to the phrase normalization stage often involving changes in word order, Levenshtein distance ceases to be a sensible measure of continuity between input and output at the phrase normalization stage. Modular Normalization & Accessory Stages The ADP normalization process is designed to be modular; because it is split into discrete processes for character, word, and phrase normalization, it is possible to plug in accessory stages to address dataset-specific normalization needs that are not easily handled within the pre-defined stages. The data-location dataset, for instance, implements an accessory stage to split list-like data items into individual strings for data location. Data-Location Splitting Because the data-location dataset contained list-like data items in which several distinct data locations were included in a single data item (e.g., the real data item “Fig. 2 A,B,C, Fig. 6 .”), phrase normalization would be much more difficult without splitting list-like inputs into multiple items that could then be normalized independently. The script functions as a pre-phrase-normalization stage for the data-location dataset; that script creates multiple rows from list-like data items, transforming the single data item “Fig. 2 A,B,C, Fig. 6 .” into a set of segments including “figure 2 a”, “figure 2 b”, “figure 2 c”, and “figure 6 ”. Each segment is separated into a distinct row, which is assigned a post-splitting index and an original index to be able to both track segments individually and trace them back to the list-like data items from which they were originally split. When phrase normalization is applied to the data-location dataset, because the segments have been split into their own rows, they are treated as distinct phrases, allowing all of the “figure x ” example segments above to match to a single pattern, rather than needing dedicated patterns to match to each list-like permutation. Sample Normalized Data Items Table 6 contains sample data items from the age and data-location datasets. The columns represent the progression of these data items through the normalization process, with changes made by the character, word, and phrase normalization parts of the code represented in those respective columns. Note that for the data-location dataset, the list-like phrase-normalized strings are split into individual TSV rows for each data item in the list, e.g., the single input data item “Fig. 2 A,B,C, Fig. 6 .” becomes four output data items: “figure 2 a”, “figure 2 b”, “figure 2 c”, and “figure 6 ”. Table 6 Sample Data Items at Each Stage Dataset Before Normalization Character Normalized Word Normalized Phrase Normalized Age Six week old six week old 6 week 6 week Age 6–8 week 6 to 8-week old 6 to 8-week 6–8 week Age Median age 6.3 years median age 6.3 years median 6.3 year median: 6.3 year Data-Location Additional File 4, Tables 1 and 2 additional file 4, Tables 1 and 2 additional file 4, Table 1 and 2 ['additional file 4', 'Table 1 ', 'Table 2 '] Data-Location Figure 2 A,B,C, Fig. 6 . Figure 2 a,b,c, Fig. 6 figure 2 a,b,c, Fig. 6 ['figure 2 a', 'figure 2 b', 'figure 2 c', 'figure 6 '] Data-Location Figure 2 A,B, Suppl Fig. 2 figure 2 a,b, suppl Fig. 2 figure 2 a,b, supplemental Fig. 2 ['figure 2 a', 'figure 2 b', 'supplemental Fig. 2'] Results Using ADP’s normalization scripts on the IEDB age and data-location datasets demonstrates that it is possible to use ADP to effect significant improvements to the overall standardization of a dataset. User Action Efficiency ADP is a tool for the development and implementation of standardization rules. Accordingly, the thoroughness with which a user makes action decisions (in the character and word stages) and builds phrase type patterns (in the phrase stage) determines the overall success of ADP at standardizing a dataset. The data presented in this manuscript is the result of a non-exhaustive approach to both datasets in which rule-setting for particularly common characters, words, and phrases was prioritized, to represent a practical and realistic normalization outcome. Table 7 provides an overview of the extent of the normalization rule-setting done for each dataset. The “items in review” counts reflect the number of characters or words for which action decisions were not made at the time of manuscript submission. The “items in reference” counts reflect the number of characters or words for which action decisions were made. The “phrase-type patterns” counts reflect the number of user-generated patterns against which phrases are matched to determine their validity, and “valid phrase-type patterns” reflect how many of the defined patterns are specified as valid phrases. Table 7 Number of Action Decisions by Dataset Age Dataset Data-Location Dataset Characters in review 1 7 Words in review 84 1160 Characters in reference 21 39 Words in reference 94 5780 counting mass-allowed Protein Data Bank (PDB) identifiers, otherwise 186 3 Phrase-type patterns 16 12 Valid phrase-type patterns 9 11 The results presented in this manuscript are accordingly the results of a fairly conservative rule-setting effort intended to prioritize the creation of rules targeting high-occurrence characters, words, and phrase patterns. More comprehensive normalization and higher validity rates at each stage could be achieved by targeting increasingly lower-frequency characters, words, and phrases. Ultimately, reasonable stopping points will vary for each dataset; making action decisions and creating phrase patterns for increasingly infrequent characters, words, and phrases offers diminishing returns in overall dataset standardization. Validity Rates by Dataset and Stage ADP validates data items at each stage. In the character stage, data items pass validation if they contain only characters that have been marked as allowed. Data items pass validation at the word stage if they contain only words that have been marked as allowed. In the phrase stage, data items pass validation if they match to a pattern designated as valid. The word and phrase stages only attempt to normalize data items that have passed validation in the previous stage(s). The Validation Results by Dataset and Stage figures below show the rates of validity achieved with the aforementioned non-exhaustive rule-setting approach. In the character stage, validity rates for both datasets are above 99%. These character validation results were achieved following 21 action decisions for the age dataset and 39 action decisions for the data-location dataset in the character normalization stage (see Table 7 ). In the word stage, validity rates for both data sets are above 98%. These word validation results were achieved following 94 action decisions for the age dataset and 187 action decisions 4 for the data-location dataset in the character normalization stage (see Table 7 ). The age dataset’s validity rate at the phrase stage is significantly lower than that of the data-location dataset: 83.8% of data items pass phrase validation in the age dataset, while 97.9% of data items pass phrase validation in the data-location dataset. This is the result of a relatively large number of data items that match invalid patterns. In particular, for the age dataset, numerical exact values (e.g., “7”) and ranges without units (e.g., “8–10”) are designated as invalid phrase types because that age dataset contains ages expressed in hours, days, weeks, months, and years, so any number-unit combination is theoretically possible; without a unit, numerical age values are practically meaningless. As is recorded in the phrase-normalized age dataset file, of the 1019 data items that failed phrase validation, only 105 (1.47% of all data items) failed because they did not match any pattern; all the rest failed because they matched a pattern designated as invalid. These phrase validation results were achieved by matching against 16 phrase-type patterns for the age dataset and 12 patterns for the data-location dataset (see Table 7 ). It is evident that a relatively low number of user action decisions is sufficient to produce very high rates of validity in at least these two free-text datasets. Notably, in both the character and word stages, reaching similar results (> 99% validity in the character stage and > 98% validity in the word stage) in the two datasets required only about twice as many action decisions in the data-location dataset as in the age dataset, despite that the former dataset is more than 35 times longer than the latter. Extent of Change to Data Items In the character and word stages, the values in the Levenshtein distance score columns (see Measuring String Change During Normalization above) serve as indicators of the extent to which strings are modified during the normalization process. Figures 4 and 5 show the frequency distributions of Levenshtein distance scores by dataset and stage. Note that the word stage figures for both datasets use a logarithmic scale for clarity. For the age dataset, Levenshtein distance score frequency graphs show that most data items receive little modification during the character and word stages. The notable spike at a score of 1 in the word stage results from the abundance of age data items with plural units that were normalized to singular; the score of 1 frequently represents the removal of an “s” from “years”, “months”, or “weeks.” In the data-location dataset, the uniform nature of much of the dataset (namely the > 200,000 lines of HLA Ligand Atlas URLs) produces other spikes in the character stage Levenshtein distance frequency chart. The spike at 9 is one such case. Of the 57,517 data-location data items with a Levenshtein distance score of 9 at the character stage, 91% (52,522) are HLA Ligand Atlas URLs that have paths that a string of 9 uppercase letters (e.g., “ https://hla-ligand-atlas.org/peptide/AAAAAQSVY ”). The URLs resolve in the same way with lowercase and uppercase letters in that path; the former URL is functionally equivalent to “ https://hla-ligand-atlas.org/peptide/aaaaaqsvy ”, so normalizing to lowercase does not result in any lost meaning. Levenshtein distance scores at the word stage cluster strongly around 0 for the data-location dataset, a reflection of the fact that a large portion of the dataset, namely the URLs, received no word normalization. Figure 5: Levenshtein Distance Scores by Stage, Data-Location Dataset Levenshtein distance ceases to be a useful metric at the phrase stage, at which it is often desirable to make significant changes to the overall structure of the data item. Straightforward and benign changes like alterations in word order produce high Levenshtein distances. Accordingly, Levenshtein distance scores are not tracked at the phrase stage. Data-Location Phrase Splitting and Phrase-Part Validity Because the data-location dataset included a high number of list-like inputs made up of several individual data locations, the data items in that dataset were put through a splitter script that divided list-like data items so that each output datum referenced exactly one data location (see Data-Location Splitting above). For this dataset, we calculate additional relevant metrics. Split phrase count (listed in the split_phrase_count column) refers to the total number of outputs split out of an original input data item; e.g., the input item “Table 1 and Fig. 1 ”, which is split into the data items “Table 1 ” and “figure 1 ”, has a split phrase count of 2. Validity rate is the number of valid output data items divided by the split phrase count. A validity rate of 1 means that every output data item that derives from a particular input data item is valid, while a validity rate of 0 means that none of those output data items are valid. Split phrase count and validity rate (along with all other analytics, like Levenshtein distance scores) are recorded in the phrase-normalized output file exactly once for each input data item so that means and frequency distributions of those metrics are not skewed by the row-count increase that occurs during phrase splitting. As is evident in Fig. 6 , the large number of HLA Ligand Atlas URLs in the dataset concentrate both the split phrase count and validity score around 1, as the URLs are all unsplit and valid. Including URLs, the mean split phrase count is 1.24 (standard deviation 0.92), and the mean phrase validity rate is 1.00 (standard deviation 0.06). It is noteworthy that the data items with high split phrase counts tend towards high validity rates. It appears that those data items tend to be simple and orderly lists, such as the data item “Figures 1 , 2 , 3 , 4 , Supplementary Figs. 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13”, which has a split phrase count of 16 and a validity rate of 1.0. Such data items are simpler to split, and their split outputs are individually simpler and more readily matchable to basic phrase patterns than the less uniform lists that occur towards the middle of the split phrase count range, such as “Table 4 and Figs. 1 and 2 and Supporting Information S2 Figure” (split phrase count 4, validity rate 0.75). When examining only non-URL data items, strong clustering around a validity rate of 1 remains, but with a more obvious spread of split phrase count values, as is evident in Fig. 7 . Excluding URLs, the mean split phrase count is 3.19 (standard deviation 1.91), and the mean phrase validity rate is 0.96 (standard deviation 0.17). The implementation of data-location phrase splitting achieves very high rates of validity even among the complicated minority made up of non-URL data items. Discussion Measuring Normalization Empirically The ADP toolset provides several metrics by which a user can measure the extent to which ADP normalization modifies the data, including Levenshtein distance scoring and validation pass/fail rates. These metrics are intended to approximate the degree to which the ADP normalization code improved the overall normality of the data without losing the original string’s meaning. However, empirically evaluating the success of the normalization process as a whole remains difficult due to the lack of a clear universal metric for dataset normalization. A useful future direction would be to establish an empirical way to measure degrees of standardization in unstructured datasets; ideally, such a metric would allow comparisons between free-text datasets’ spelling accuracy, adherence to grammar, and stylistic consistency. Evaluating the Utility of ADP While the age and data-location datasets are distinct enough in size, content, and style to make a case for the flexibility of the ADP rule-setting framework for normalization, its use on these two datasets is not sufficient to demonstrate that ADP is a useful tool for a truly wide range of free-text datasets. Further experimentation with other free-text datasets will be necessary to ensure that ADP normalization is adaptable enough to be used with a diverse range of free-text datasets. Developing frameworks for testing the accuracy of ADP’s outputs compared to other normalization methods is an active priority. ADP’s user-dependence is a design feature that was implemented specifically because we hypothesize that it will result in higher precision of normalization results compared to predictive normalization tools, which can struggle with certain context-specific normalization decisions, like handling instances of “cant” occurring as a synonym of “slang” rather than a misspelling of “can’t”, that humans can make quickly and accurately [5]. Future testing will likely include evaluating how effectively ADP normalization preserves the meaning of data items throughout the normalization process compared to analogous normalization tools. Some recent tools for free-text standardization make use of large language models to perform standardization tasks. One such tool is CleanAgent, which uses an LLM agent to identify the types of data (e.g., phone number, email address, date) in each column of a CSV, write and run Python code to standardize each column’s data based on its type, and interact with the user throughout the standardization process [7]. CleanAgent’s “hands-off” approach to data standardization is intended to ensure ease of use and efficiency, as CleanAgent is designed to enable “data scientists to input their standardization requirements in one instance” [7]. At the time of the writing of this manuscript, the authors were unable to run CleanAgent on either the age or data-location datasets. We have reached out to the developers of CleanAgent about a recurring error message; we aim to do a direct comparison between the free-text normalization outcomes from CleanAgent and ADP in the future. The following analysis of CleanAgent’s utility is accordingly based solely on the contents of the CleanAgent demonstration publication [7]. CleanAgent’s column type identification processes appear to be somewhat limited. As shown in its 2024 demonstrational publication, CleanAgent’s column type identifier attributes the datatype “address” to the “Name” column of its test CSV, which stands out as a potential misidentification; the dataset used in that demonstration is not included with that paper nor does it appear to be listed in the GitHub repository for CleanAgent [8], so we were unable to verify if that is indeed a valid datatype for that column. Furthermore, as shown in that demonstration publication, CleanAgent did not identify a datatype for the columns named “AGE” and “weight__” [7], suggesting that CleanAgent may struggle to standardize columns of data that, unlike email addresses or phone numbers, lack a well-defined standard form. CleanAgent’s LLM-based normalization appears to be an efficient solution to the problem of standardizing fairly simple data items (e.g., dates or contact information), and it is designed to accomplish normalization with very little work on the user’s part. ADP, by comparison, requires more of the user’s time and effort but handles more complex and widely-varying data items. LLM-based normalization tools are limited by the accuracy of the LLMs on which they depend. Output accuracy is a vitally important feature of standardization processes for biomedical datasets. Ostensibly, LLM accuracy will continue to improve, but in the meantime, there remains a need for high-precision standardization tools for use on datasets that require extremely high accuracy on free-text fields that are often far messier than comparatively simple “date” or “phone number” fields. ADP is designed for this niche. The authors intend to conduct further empirical evaluation of the accuracy of ADP and comparable LLM-based normalization tools. Productive Value of Results These datasets contain all distinct age and data-location values recorded in the IEDB, but many of the individual values in these datasets represent thousands of instances of that value. As a result, the effect of normalizing these datasets is multiplied. The IEDB records more than 18 million total age values and more than 21 million total data-location values [4], so applying normalization to these data values in the IEDB will accordingly produce significant improvements to findability and usability of millions of lines of data. Improving Data Findability Using the ADP normalization toolkit, we normalized the age and data-location free-text datasets from the IEDB, two datasets with very different content and normalization needs, in such a way that renders the data in these datasets searchable, findable, and ontologizable in a way that they simply were not before. Standardizing the text in these datasets will enable the forthcoming implementation of dedicated search tools for these datasets in the IEDB. For instance, IEDB users could query for data from experiments on mice less than 28 days old and receive results within that range including ages originally expressed with varying formats and units (e.g., ‘10–20 days old’, ‘2 days’, ‘24 h’, ‘1–3 wks’). Similarly, by ontologizing categorical age data items like ‘juvenile’, ‘calf’, ‘foal’, ‘piglet’, or ‘child’, we can enable searches for pre-adult life stages across species. The FAIR data principles identify searchability (principle F4) as a critical aspect of data findability, so improving the IEDB’s search functionalities is core to the IEDB’s effort to improve its overall data FAIRness [9]. The data-location dataset in particular was identified as a promising candidate for work to improve the IEDB’s FAIRness in a 2018 analysis of the IEDB’s adherence to the FAIR standards [3]. Accordingly, the normalization performed on the data-location dataset using ADP completes that long-standing goal and demonstrates the IEDB’s ongoing commitment to improving data FAIRness in immunology. Enabling Ontologization of Free-Text Data By standardizing the characters, words, and phrase structures in free-text datasets, ADP makes it easier to ontologize those datasets. Several prior publications have illustrated the benefits of linkages between IEDB data and formal ontologies [2,3,10,11]. Already, many IEDB data fields are mapped to terms from a wide range of ontologies, such as the “Organism” field being mapped to NCBI Taxonomy [12] terms and the “Evidence Code” field being mapped to Evidence Ontology [13] terms [14]. By standardizing the terms in use in the age and data-location datasets, ADP normalization is an effective step towards ontologizing the data in these fields. In particular, promising next steps include the ontologization of units in the age dataset via the Unit Ontology [15] and document parts via the Information Artifact Ontology and Ontology for Biomedical Investigations [16]. Should ADP prove effective on other free-text datasets within and beyond the IEDB, it will make it possible to reap the benefits of ontologization from large amounts of previously underutilized biomedical data. Conclusions While further testing is necessary to validate ADP normalization on other datasets, preliminary evaluations of its application to the age and data-location datasets suggest that ADP normalization can produce high rates of output validity in diverse free-text datasets following a relatively low number of user action decisions. The Immune Epitope Database (IEDB) has made significant efforts over the past several years to improve its adherence to FAIR data standards through improvements to findability and interoperability of its data. Creating linkages with formal ontologies is a pillar of the IEDB’s efforts to improve interoperability, but these efforts have been concentrated on standardized datasets. The ability to standardize free-text datasets would enable further FAIRness and more effective utilization of the vast quantities of free-text data in the IEDB. Our preliminary results are promising indications that ADP normalization can standardize free-text datasets efficiently and accurately. Declarations Ethics Approval and Consent to Participate Not applicable. Consent for Publication Not applicable. Competing Interests The authors declare that they have no competing interests. Funding Research reported in this publication was supported by the National Institutes of Health contract 75N93019C00001 and grant U24CA248138. Author Contribution All authors contributed to the conception of this project. S.D. and J.B. designed and developed the ADP software, collected resulting data, and drafted this manuscript. B.P. and J.A.O. advised on the software design and data collection process. R.V. advised on datasets to target for normalization, assisted in collection of input data, and provided substantial feedback on the software design. All authors read and approved the final manuscript. Acknowledgments We wish to acknowledge the entire IEDB and CEDAR development and curation team. Data Availability All code and data discussed in this manuscript is available in the following GitHub repository: https://github.com/sebastianduesing/adp References Gandomi A, Haider M. Beyond the hype: Big data concepts, methods, and analytics. Int J Inf Manag. 2015 Apr 1;35(2):137–44. Vita R, Mahajan S, Overton JA, Dhanda SK, Martini S, Cantrell JR, et al. The Immune Epitope Database (IEDB): 2018 update. Nucleic Acids Res. 2019 Jan 8;47(Database issue):D339–43. Vita R, Overton JA, Mungall CJ, Sette A, Peters B. FAIR principles and the IEDB: short-term improvements and a long-term vision of OBO-foundry mediated machine-actionable interoperability. Database. 2018 Jan 1;2018:bax105. Duesing S. sebastianduesing/adp [Internet]. 2024 [cited 2024 Jul 1]. Available from: https://github.com/sebastianduesing/adp Clark E, Araki K. Text Normalization in Social Media: Progress, Problems and Applications for a Pre-Processing System of Casual English. Procedia - Soc Behav Sci. 2011 Jan 1;27:2–11. Sproat R, Black AW, Chen S, Kumar S, Ostendorf M, Richards CD. Normalization of non-standard words. Comput Speech Lang. 2001 Jul 1;15(3):287–333. Qi D, Wang J. CleanAgent: Automating Data Standardization with LLM-based Agents [Internet]. arXiv; 2024 [cited 2024 Sep 30]. Available from: http://arxiv.org/abs/2403.08291 sfu-db/CleanAgent: This is an experimental demo repository of agent on data cleaning task [Internet]. [cited 2024 Oct 1]. Available from: https://github.com/sfu-db/CleanAgent GO FAIR [Internet]. [cited 2024 Jun 7]. F4: (Meta)data are registered or indexed in a searchable resource. Available from: https://www.go-fair.org/fair-principles/f4-metadata-registered-indexed-searchable-resource/ Vita R, Overton JA, Greenbaum JA, Ponomarenko J, Clark JD, Cantrell JR, et al. The immune epitope database (IEDB) 3.0. Nucleic Acids Res. 2015 Jan 28;43(Database issue):D405–12. Vita R, Overton JA, Sette A, Peters B. Better living through ontologies at the Immune Epitope Database. Database J Biol Databases Curation. 2017 Mar 18;2017:bax014. Database Resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2017 Jan 4;45(Database issue):D12–7. Chibucos MC, Mungall CJ, Balakrishnan R, Christie KR, Huntley RP, White O, et al. Standardized description of scientific evidence using the Evidence Ontology (ECO). Database J Biol Databases Curation. 2014;2014:bau075. Vita R, Overton JA, Peters B. Identification of errors in the IEDB using ontologies. Database J Biol Databases Curation. 2018 Feb 22;2018:bay005. Gkoutos GV, Schofield PN, Hoehndorf R. The Units Ontology: a tool for integrating units of measurement in science. Database J Biol Databases Curation. 2012 Oct 5;2012:bas033. Bandrowski A, Brinkman R, Brochhausen M, Brush MH, Bug B, Chibucos MC, et al. The Ontology for Biomedical Investigations. PLoS ONE. 2016 Apr 29;11(4):e0154556. DOCUMENT_PREFERENCES Footnotes The reference files can be found at the following paths in the repository: age/output_files/char_reference.tsv age/output_files/word_reference.tsv data_loc/output_files/char_reference.tsv data_loc/output_files/word_reference.tsv The phrase type files can be found at the following paths in the repository: age/input_files/age_phrase_types.tsv data_loc/input_files/data_loc_phrase_types.tsv The data-location dataset contained a large number of Protein Data Bank (PDB) identifiers that parsed as distinct words. These IDs, which follow a standard four-character alphanumeric format, were selected using a regular expression and then mass-allowed. There are 186 non-PDB-ID words in the data-location reference file. 186 row-by-row action decisions plus one mass-allow of Protein Data Bank (PDB) identifiers, as described in footnote 1, performed via regular expression selection of the rows containing PDB identifiers. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Published Journal Publication published 22 Mar, 2025 Read the published version in Journal of Biomedical Semantics → Version 1 posted Editorial decision: Revision requested 11 Dec, 2024 Reviews received at journal 11 Dec, 2024 Reviews received at journal 27 Nov, 2024 Reviews received at journal 26 Nov, 2024 Reviewers agreed at journal 06 Nov, 2024 Reviewers agreed at journal 06 Nov, 2024 Reviewers agreed at journal 06 Nov, 2024 Reviewers agreed at journal 06 Nov, 2024 Reviewers invited by journal 06 Nov, 2024 Editor assigned by journal 04 Nov, 2024 Submission checks completed at journal 03 Nov, 2024 First submitted to journal 30 Oct, 2024 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-5363542","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":375078972,"identity":"490b1158-604e-4772-b7fd-afa0944b189c","order_by":0,"name":"Sebastian Duesing","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAwElEQVRIiWNgGAWjYFCCBCA2YGDgZ2dIOECaFslm0rSAdB0m1lny7TmGD38U3LHbfJjh4QGGint2DYS0GJx5Y2zMY/AsedthkMPOFCcT1iKRYyYNdFWyGUgLY1tCMmGHzcgx//kDqMW4mVgtDDdyzBh4DA7bGTBDtNgR1GFw5lmxNFBLggTIYQlnEhIIO6w9eePHH38O2/O39yR/+FCRYE/YYVCQ2MDAkwCKIyCDSAA0nP0AlDEKRsEoGAWjABUAAKHjQBmTnlfkAAAAAElFTkSuQmCC","orcid":"","institution":"La Jolla Institute For Allergy \u0026 Immunology","correspondingAuthor":true,"prefix":"","firstName":"Sebastian","middleName":"","lastName":"Duesing","suffix":""},{"id":375078975,"identity":"42471a75-fc28-4195-a146-92af19799b87","order_by":1,"name":"Jason Bennett","email":"","orcid":"","institution":"La Jolla Institute For Allergy \u0026 Immunology","correspondingAuthor":false,"prefix":"","firstName":"Jason","middleName":"","lastName":"Bennett","suffix":""},{"id":375078976,"identity":"dbc15fc0-869c-41e9-872f-229d0a0962d4","order_by":2,"name":"James A. Overton","email":"","orcid":"","institution":"Knocean Inc","correspondingAuthor":false,"prefix":"","firstName":"James","middleName":"A.","lastName":"Overton","suffix":""},{"id":375078977,"identity":"997d27d6-b2cb-4af2-9277-f1fc5cb3bafc","order_by":3,"name":"Randi Vita","email":"","orcid":"","institution":"La Jolla Institute For Allergy \u0026 Immunology","correspondingAuthor":false,"prefix":"","firstName":"Randi","middleName":"","lastName":"Vita","suffix":""},{"id":375078978,"identity":"7bce7850-03ad-4e8d-bb51-94de050918a5","order_by":4,"name":"Bjoern Peters","email":"","orcid":"","institution":"La Jolla Institute For Allergy \u0026 Immunology","correspondingAuthor":false,"prefix":"","firstName":"Bjoern","middleName":"","lastName":"Peters","suffix":""}],"badges":[],"createdAt":"2024-10-30 21:38:15","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-5363542/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-5363542/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1186/s13326-025-00324-7","type":"published","date":"2025-03-22T15:56:57+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":69000268,"identity":"16db8112-4b6a-4ed0-adce-68a0600bc102","added_by":"auto","created_at":"2024-11-14 11:26:40","extension":"jpg","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":357193,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eFlowchart of ADP Character and Word Normalization Processes\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"Fig1Flowchart.jpg","url":"https://assets-eu.researchsquare.com/files/rs-5363542/v1/5538047da9dfd0f721ea64fb.jpg"},{"id":69000266,"identity":"6d3207ef-15ea-42fa-bf56-598a18145bbd","added_by":"auto","created_at":"2024-11-14 11:26:40","extension":"jpg","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":330279,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eFlowchart of ADP Phrase Normalization Processes\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"Fig2Flowchart.jpg","url":"https://assets-eu.researchsquare.com/files/rs-5363542/v1/e0d99917f816b127845901f1.jpg"},{"id":69000406,"identity":"1d784da7-e2da-44e1-aca9-afda02abc0b7","added_by":"auto","created_at":"2024-11-14 11:34:40","extension":"jpg","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":208126,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eValidation Results by Dataset and Stage\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"Fig3Piecharts.jpg","url":"https://assets-eu.researchsquare.com/files/rs-5363542/v1/a1510ba0d979ebd505f7b92f.jpg"},{"id":69000407,"identity":"266daaee-405a-4014-b478-c10632a3b8cb","added_by":"auto","created_at":"2024-11-14 11:34:40","extension":"jpg","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":278507,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eLevenshtein Distance Scores by Stage, Age Dataset\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"Fig4BarchartsAge.jpg","url":"https://assets-eu.researchsquare.com/files/rs-5363542/v1/94362840b5b1187cb0b207b6.jpg"},{"id":69000269,"identity":"558d03b6-9e06-4253-8d62-4d047afc4706","added_by":"auto","created_at":"2024-11-14 11:26:40","extension":"jpg","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":418138,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eLevenshtein Distance Scores by Stage, Data-Location Dataset\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"Fig5BarchartsDL.jpg","url":"https://assets-eu.researchsquare.com/files/rs-5363542/v1/cd13cb58b75f943efed2ba85.jpg"},{"id":79120370,"identity":"1d55159f-9dc0-45bb-915e-7eddbaa04928","added_by":"auto","created_at":"2025-03-24 16:03:19","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":2486601,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-5363542/v1/4e15f651-6e4e-4c33-a41a-512f911c6c84.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Standardizing Free-Text Data Exemplified by Age and Data-Location Fields in the Immune Epitope Database","fulltext":[{"header":"Background","content":"\u003cp\u003eA lot of data within and outside the biomedical field is unstructured, with estimates ranging as high as 95% [1]. Unstructured data is commonly underutilized due to the difficulty of automatically extracting meaningful information. In our work on the Immune Epitope Database (IEDB) [2], we have found that the unstructured data also lags behind structured data in adherence to FAIR data standards; in a 2018 analysis of the IEDB\u0026rsquo;s progress towards improved data FAIRness, an area identified for improvement was the findability of unstructured free-text data [3]. Normalizing free-text data, \u003cem\u003ei.e.\u003c/em\u003e, removing variance that does not affect meaning from text, enables linkages between the unstructured data and structured vocabularies like ontologies, which can significantly improve the FAIRness and usability of the data. This paper presents a novel repository of Python scripts for free-text data normalization and an evaluation of the application of these scripts to two different sets of biomedical data from the IEDB, an age dataset and a data-location dataset.\u003c/p\u003e \u003cp\u003eVariance, a term that this paper uses to refer to differences in representations of information that do not change meaning, is a key problem of free-text normalization. Free-text data can contain several different kinds of variance. Character variance (such as differences in diacritic usage, whitespace, or encoding) differentiates data items like \u0026ldquo;6\u0026ndash;8 weeks\u0026rdquo; and \u0026ldquo;6\u0026ndash;8 weeks\u0026rdquo;. Word-level variance, which includes misspellings, abbreviations, synonyms, and colloquialisms, differentiates data items like \u0026ldquo;6\u0026ndash;8 weeks\u0026rdquo; and \u0026ldquo;6\u0026ndash;8 wks\u0026rdquo;. Phrase-level variance includes the ways that one idea can be expressed with different permutations of words, and it differentiates data items like \u0026ldquo;6\u0026ndash;8 weeks\u0026rdquo; and \u0026ldquo;6 to 8 weeks\u0026rdquo;. The data items \u0026ldquo;6\u0026ndash;8 weeks\u0026rdquo;, \u0026ldquo;6\u0026ndash;8 weeks\u0026rdquo;, \u0026ldquo;6\u0026ndash;8 wks\u0026rdquo;, and \u0026ldquo;6 to 8 weeks\u0026rdquo; all mean the same thing, but in their unstandardized free-text forms, they are all parsed as distinct. The aim of free-text normalization is to ensure that data items that mean the same thing look the same way. The extent to which each of those three types of variance might exist in a particular dataset is highly dependent on the nature of the data. To be broadly applicable to free-text datasets of all sorts, a free-text normalization tool must be able to address all three types of variance in a way that is flexible enough to account for different datasets\u0026rsquo; unique normalization needs.\u003c/p\u003e \u003cp\u003eThere is a robust history of development of automated tools for addressing some types of variance, such as spell-check technologies, but there are comparatively few holistic tools designed to normalize dataset variance at the character, word, and phrase levels. To that end, we created the free-text normalization tool ADP, which stands for Adaptable, user-Dependent, and Precise. In this paper, we examine the application of ADP to two free-text datasets from the IEDB: the age dataset and the data-location dataset, both of which were accessed using SQL queries.\u003c/p\u003e \u003cp\u003eThe age dataset records the ages of subjects in investigations archived in the IEDB. It contains 7,151 total unique organism-age pairs (e.g., age: \u0026ldquo;6\u0026ndash;8 weeks old\u0026rdquo;, organism name: \u0026ldquo;Mus musculus C57BL/6\u0026rdquo;), meaning some age values are duplicated in that dataset because they occur with multiple organisms; there are 4,095 unique age value strings. Strings in the age dataset typically contained one piece of information per string, and where list-like strings were present, they were legitimate lists ostensibly linked to studies that investigated subjects at multiple specific ages, e.g., the data item \u0026ldquo;21, 27 and 36 weeks\u0026rdquo;.\u003c/p\u003e \u003cp\u003eThe data-location dataset records the manuscript locations in which certain data are found. It contains 251,810 unique data-location strings, such as \u0026ldquo;Cited reference [PMID: 16472860]\u0026rdquo;. In contrast with the age dataset, many strings in this dataset contained several individually valid data locations in a single line, such as \u0026ldquo;Data set S1 and S11 and Figs.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e, \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e, \u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e, and \u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e\u0026rdquo;.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eExample Age \u0026amp; Data-Location Data Items\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"2\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAge Dataset\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eData-Location Dataset\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e6 to 8 weeks\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFigures \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e, \u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e, \u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e, 5, \u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e6\u003c/span\u003e, S4, S5, S7, Tables\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e, \u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e, \u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e and \u003cspan refid=\"Tab5\" class=\"InternalRef\"\u003e5\u003c/span\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAdults (pregnant)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003ePDB: 5EC1, 5EC2, 5EBW, 5EBL, 5EBM\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMean age of 32.2 years with a range from 18 to 49 years\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eRichardson et al. Virol 1986;155:508\u0026ndash;523 [PMID: 3788062]\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e18\u0026ndash;22 months or 4\u0026ndash;6 months\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003epg. 1410 and J. Virol. 61:1358\u0026ndash;1367\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e "},{"header":"Methods","content":"\u003cp\u003eADP is a non-fully-automated normalization tool that enables a user to create standardization rules and apply them to datasets, which is available on GitHub [4]. The ADP normalization scripts are written in Python version 3.10. The core normalization scripts import the libraries os, re, and sys from the Python Standard Library and the non-native library editdistance (imported as ed). ADP is open-source software licensed under GNU GPL-3.0.\u003c/p\u003e \u003cp\u003eADP\u0026rsquo;s three core normalization scripts (char_normalizer.py, word_normalizer.py, and phrase_normalizer.py) address the three types of variance outlined in the introduction: character-, word-, and phrase-level variance. At the character and word stages, ADP also logs a Levenshtein distance score for each data item to indicate the extent of the changes made in that stage. ADP uses a script (calculate_metrics.py) to pull relevant metrics from the normalized output files and generate figures using the Python libraries ast, math, matplotlib.pyplot (imported as plt), pandas (imported as pd), seaborn (imported as sns), and warnings.\u003c/p\u003e \u003cp\u003eADP Text Normalization Workflow\u003c/p\u003e \u003cp\u003eAction Decision-Based Normalization of Characters and Words\u003c/p\u003e \u003cp\u003eWhile standardizing character variance can be as simple as selecting acceptable special characters and determining case-sensitivity of the data, standardizing word-level variance involves identifying and correcting misspellings in free-text data, a process which is well-known to be \u0026ldquo;cumbersome\u0026rdquo; [5]. Normalization tools must also be able to handle \u0026ldquo;non-standard words,\u0026rdquo; including numbers, acronyms, and other abbreviations [6]. Some existing word normalization tools overcorrect and have higher rates of \u0026ldquo;unresolved errors,\u0026rdquo; or incorrectly-spelled words that the tool swaps with a context-incorrect word; others tend to undercorrect, e.g., by failing to recognize \u0026ldquo;cant\u0026rdquo; as a misspelling of \u0026ldquo;can\u0026rsquo;t\u0026rdquo; [5]. ADP uses an iterative character and word normalization process designed to prioritize accuracy of outputs.\u003c/p\u003e \u003cp\u003eThe character and word normalization scripts share a similar rule-building workflow. When one of these two scripts is run on a dataset for the first time, it identifies distinct text units (characters or words, which for ADP\u0026rsquo;s purposes is a sequence of characters delineated by one of several common separators, like hyphens, spaces, or punctuation, or the start or end of a string) and creates a review file to be used for normalization rule-setting.\u003c/p\u003e \u003cp\u003eThe review file is a TSV containing one row for each distinct character\u0026mdash;except lowercase letters, digits, and a small number of basic punctuation characters, which are treated as valid for character normalization\u0026mdash;or word found in the file. It has columns for the character or word, its context (i.e., the data item strings in which that character or word was found), and a count of its occurrences. The review file also has four action columns with the headings \u0026ldquo;replace_with\u0026rdquo;, \u0026ldquo;remove\u0026rdquo;, \u0026ldquo;invalidate\u0026rdquo;, and \u0026ldquo;allow\u0026rdquo;. Entering text in one of the action columns (which we refer to as \u0026ldquo;making an action decision\u0026rdquo;) sets a rule for the behavior of the script concerning the character or word in that row during future runs of the script. Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e describes how entering text in one of the action columns modifies the behavior of the script.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eAction Decisions\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"2\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAction Column\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFunction\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ereplace_with\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eThis character or word is replaced with the text that is entered in this column.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eremove\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eThis character or word is removed from the data items in which it occurs.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003einvalidate\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eThis character or word remains as-is, and data items containing this character or word will fail validation.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eallow\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eThis character or word remains as-is, and this character or word is considered an accepted text unit for validation.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eEvery time the script is rerun, it moves any review file rows in which an action decision has been made to a reference file, which serves as a bank of rules for the behavior of the script.\u003c/p\u003e \u003cp\u003eTables\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e and \u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e contain examples of the rules applied to these datasets at the character and word stages. To see all normalization rules applied at the character and word stages, please refer to the reference files in the ADP repository [4].\u003ca class=\"FNLink\" href=\"#Fn1\" id=\"#FNLinkFn1\"\u003e\u003c/a\u003e\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eSample Character Normalization Rules \u0026amp; Applications to Data Items\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"6\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDataset\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eChar.\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eOccurrences\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eExample string\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eRule\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003ePost-normalization string\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eage\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e=\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e65\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u0026ldquo;mean age\u0026thinsp;=\u0026thinsp;30 years\u0026rdquo;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eAllow\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e\u0026ldquo;mean age\u0026thinsp;=\u0026thinsp;30 years\u0026rdquo;\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eage\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u0026ndash;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e31\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u0026ldquo;20\u0026ndash;67 years\u0026rdquo;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eReplace with:\u003c/p\u003e \u003cp\u003e-\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e\u0026ldquo;20\u0026ndash;67 years\u0026rdquo;\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003edata-loc\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u0026amp;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e53\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u0026ldquo;Abstract \u0026amp; p. 664\u0026rdquo;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eReplace with:\u003c/p\u003e \u003cp\u003eand\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e\u0026ldquo;abstract and p. 664\u0026rdquo;\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003edata-loc\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u0026euro;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e10\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u0026ldquo;Figure \u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e and Fig.\u0026nbsp;1\u0026acirc;\u0026euro;\u0026rdquo;figure supplement 1 and PDB 6HD8\u0026rdquo;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eInvalidate\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003eInvalid, not normalized\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab4\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 4\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eSample Word Normalization Rules \u0026amp; Applications to Data Items\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"6\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDataset\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eWord\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eOccurrences\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eExample string\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eRule\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003ePost-normalization string\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eage\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eold\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e710\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u0026ldquo;6\u0026ndash;10 week old\u0026rdquo;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eRemove\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e\u0026ldquo;6\u0026ndash;10 week\u0026rdquo;\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eage\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003ewk\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e57\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u0026ldquo;8\u0026ndash;10 wk\u0026rdquo;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eReplace with:\u003c/p\u003e \u003cp\u003eweek\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e\u0026ldquo;8\u0026ndash;10 week\u0026rdquo;\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003edata-loc\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003efig\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e285\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u0026ldquo;Figs.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e and \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e\u0026rdquo;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eReplace with:\u003c/p\u003e \u003cp\u003efigure\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e\u0026ldquo;figure \u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e and \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e\u0026rdquo;\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003edata-loc\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003efile\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e148\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u0026ldquo;additional file 1\u0026rdquo;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eAllow\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e\u0026ldquo;additional file 1\u0026rdquo;\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eFollowing the transfer of rows with new action decisions from the review file to the reference file, the script runs its normalization functions, applying the rules based on the user\u0026rsquo;s action decisions to the dataset, and it checks for any new text units that do not have a line in either the review file or the reference. See Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e for a visual representation of how this process works during the character normalization stage.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eIn the character normalization stage, data items pass validation if in the second reference check (as shown in the diagram), only allowed characters are found in the string; otherwise, validation fails. Only data items that pass character-level validation are normalized in the word normalization stage. Data items pass word-level validation if in the second reference check, only allowed words are found in the string; otherwise, validation fails.\u003c/p\u003e \u003cp\u003ePattern-Based Normalization of Phrases\u003c/p\u003e \u003cp\u003eADP phrase normalization uses a process of matching phrase structures to user-defined patterns. This process begins in the word normalization stage. In the word review and reference TSV, there is an additional \u0026ldquo;category\u0026rdquo; column. Adding text to this column in the row of a particular word asserts the category to which that word belongs, \u003cem\u003ee.g.\u003c/em\u003e, in rows for the words \u0026ldquo;week\u0026rdquo;, \u0026ldquo;month\u0026rdquo;, and \u0026ldquo;year\u0026rdquo;, the category has been set to \u0026ldquo;unit\u0026rdquo; in the word reference TSV for the age dataset.\u003c/p\u003e \u003cp\u003eWhen the phrase normalization script is called, it divides the data item into individual words as was done for the word normalization phase. The script tracks the word\u0026rsquo;s place in the string and any delimiters (including punctuation, whitespace, and the start or end of a string) on either side of the word. Then, it searches the word reference file to see if a category has been assigned to the word; if not, it categorizes the word as \u0026ldquo;unknown\u0026rdquo;. The script produces a string that uses a simple grammar to indicate the categories of each word and their position in the string, e.g., the age datum \u0026ldquo;6 week mean\u0026rdquo; is parsed as \u0026ldquo;[number(0)][unit(1)][statistical(2)]\u0026rdquo;. The phrase categorization string is stored in a dedicated column in the phrase normalization output file to enable the user to determine which phrase structures occur the most frequently in a dataset and develop normalization rules accordingly.\u003c/p\u003e \u003cp\u003eLike the character and word normalization phases, the phrase normalization phase depends on the user to create rules for distinct phrase structures. A dataset\u0026rsquo;s phrase-type ruleset (found in age_phrase_types.tsv and data_loc_phrase_types.tsv) establishes a name for a pattern, indicates whether or not it is a valid pattern (e.g., in the age dataset, a data item consisting of a number and a unit is valid, but a number by itself is not, as being unitless makes its meaning uncertain), and sets a rule for how phrases that match that pattern should be formatted. See Table\u0026nbsp;\u003cspan refid=\"Tab5\" class=\"InternalRef\"\u003e5\u003c/span\u003e for examples.\u003c/p\u003e \u003cp\u003eThe categorization string, e.g., [number(0)][unit(1)][statistical(2)] (extracted from \u0026ldquo;6 week mean\u0026rdquo;), is matched to a pattern\u0026mdash;in this case, the pattern called \u0026ldquo;statistical\u0026rdquo;\u0026mdash;which matches to the structures of data items that provide a mean or median age value. In the \u0026ldquo;standard_form\u0026rdquo; column in the phrase-type ruleset, the user can specify how data items matching a pattern should be formatted. In the case of \u0026ldquo;6 week mean\u0026rdquo;, the standard form is represented as \u0026ldquo;[2]: [0] [1]\u0026rdquo;, in which the numbers in brackets refer to the indices from the categorization string, and how they should be arranged within the standard form string.\u003c/p\u003e \u003cp\u003eThe phrase normalization script generates a blank phrase-type ruleset file if none exists, but if one exists, it checks each data item\u0026rsquo;s categorization string against any patterns in the file and applies the pattern in the \u0026ldquo;standard_form\u0026rdquo; column if applicable by inserting words where their indices are placed in the standard form string. Through this process, \u0026ldquo;6 week mean\u0026rdquo; is rearranged to match the standard form string \u0026ldquo;[2]: [0] [1]\u0026rdquo;, so the output for that data item is \u0026ldquo;mean: 6 week\u0026rdquo;. This workflow ensures that data items with diverse structures, like \u0026ldquo;6 week mean\u0026rdquo; and \u0026ldquo;mean\u0026thinsp;=\u0026thinsp;6 week\u0026rdquo;, take on a single standard phrase structure, like \u0026ldquo;mean: 6 week\u0026rdquo;. The specific structure we chose for data items of this type is arbitrary; the crucial part is the ability to quickly modify diversely expressed data items into one standard style.\u003c/p\u003e \u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab5\" class=\"InternalRef\"\u003e5\u003c/span\u003e below contains sample rows from both datasets\u0026rsquo; phrase type tables as examples of the rules applied to these datasets. To see all normalization rules applied at the phrase stage, please refer to the phrase type files in the ADP repository [4].\u003ca class=\"FNLink\" href=\"#Fn2\" id=\"#FNLinkFn2\"\u003e\u003c/a\u003e\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab5\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 5\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eSample Phrase Normalization Rules\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"6\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDataset\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003ePattern name\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePattern\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eStandard form\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eExample matched phrases\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eExample normalized phrases\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eage\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003erange\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e[number(0)] [range_indicator(1)] [number(2)] [unit(3)]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e[0]-[2] [3]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e\u0026ldquo;6 to 8-week\u0026rdquo;,\u003c/p\u003e \u003cp\u003e\u0026ldquo;44.9 to 74.1 year\u0026rdquo;,\u003c/p\u003e \u003cp\u003e\u0026ldquo;36 to 68.2 year\u0026rdquo;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e\u0026ldquo;6\u0026ndash;8 week\u0026rdquo;,\u003c/p\u003e \u003cp\u003e\u0026ldquo;44.9\u0026ndash;74.1 year\u0026rdquo;,\u003c/p\u003e \u003cp\u003e\u0026ldquo;36-68.2 year\u0026rdquo;\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eage\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003estatistical\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e[statistical(0)] [number(1)] [unit(2)]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e[0]: [1] [2]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e\u0026ldquo;mean 29.8 year\u0026rdquo;,\u003c/p\u003e \u003cp\u003e\u0026ldquo;mean: 30 year\u0026rdquo;,\u003c/p\u003e \u003cp\u003e\u0026ldquo;median : 7.5 year\u0026rdquo;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e\u0026ldquo;mean: 29.8 year\u0026rdquo;,\u003c/p\u003e \u003cp\u003e\u0026ldquo;mean: 30 year\u0026rdquo;,\u003c/p\u003e \u003cp\u003e\u0026ldquo;median: 7.5 year\u0026rdquo;\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003edata-loc\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003epdb id\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e[pdb(0)] [pdb_id(1)]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e[0] [1]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e\u0026ldquo;pdb 1mfd\u0026rdquo;, \u0026ldquo;pdb 1rzj\u0026rdquo;, \u0026ldquo;pdb 1rzk\u0026rdquo;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e\u0026ldquo;pdb 1mfd\u0026rdquo;, \u0026ldquo;pdb 1rzj\u0026rdquo;, \u0026ldquo;pdb 1rzk\u0026rdquo;\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003edata-loc\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eloc number\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e[location(0)] [number(1)]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e[0] [1]\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e\u0026ldquo;page 11782\u0026rdquo;, \u0026ldquo;information 9\u0026rdquo;, \u0026ldquo;data 1\u0026rdquo;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e\u0026ldquo;page 11782\u0026rdquo;, \u0026ldquo;information 9\u0026rdquo;, \u0026ldquo;data 1\u0026rdquo;\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eOnly data items passing validation at the character and word stages are normalized at the phrase stage. Data items pass phrase-level validation if they match a pattern designated as valid in the phrase-type ruleset. Otherwise, validation fails. The phrase normalization and validation processes are visualized in the flowchart below.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eMeasuring String Change During Normalization\u003c/p\u003e \u003cp\u003eThe ADP normalization code imports the package editdistance to measure the Levenshtein distance between the inputs and outputs in the character and word normalization stages. The normalized output files dedicated columns for distance scores comparing the character-normalized string against the original and the word-normalized string against the character-normalized string. Due to the phrase normalization stage often involving changes in word order, Levenshtein distance ceases to be a sensible measure of continuity between input and output at the phrase normalization stage.\u003c/p\u003e \u003cp\u003eModular Normalization \u0026amp; Accessory Stages\u003c/p\u003e \u003cp\u003eThe ADP normalization process is designed to be modular; because it is split into discrete processes for character, word, and phrase normalization, it is possible to plug in accessory stages to address dataset-specific normalization needs that are not easily handled within the pre-defined stages. The data-location dataset, for instance, implements an accessory stage to split list-like data items into individual strings for data location.\u003c/p\u003e \u003cp\u003eData-Location Splitting\u003c/p\u003e \u003cp\u003eBecause the data-location dataset contained list-like data items in which several distinct data locations were included in a single data item (e.g., the real data item \u0026ldquo;Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eA,B,C, Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e6\u003c/span\u003e.\u0026rdquo;), phrase normalization would be much more difficult without splitting list-like inputs into multiple items that could then be normalized independently. The script functions as a pre-phrase-normalization stage for the data-location dataset; that script creates multiple rows from list-like data items, transforming the single data item \u0026ldquo;Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eA,B,C, Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e6\u003c/span\u003e.\u0026rdquo; into a set of segments including \u0026ldquo;figure \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ea\u0026rdquo;, \u0026ldquo;figure \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eb\u0026rdquo;, \u0026ldquo;figure \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ec\u0026rdquo;, and \u0026ldquo;figure \u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e6\u003c/span\u003e\u0026rdquo;. Each segment is separated into a distinct row, which is assigned a post-splitting index and an original index to be able to both track segments individually and trace them back to the list-like data items from which they were originally split.\u003c/p\u003e \u003cp\u003eWhen phrase normalization is applied to the data-location dataset, because the segments have been split into their own rows, they are treated as distinct phrases, allowing all of the \u0026ldquo;figure \u003cem\u003ex\u003c/em\u003e\u0026rdquo; example segments above to match to a single pattern, rather than needing dedicated patterns to match to each list-like permutation.\u003c/p\u003e \u003cp\u003eSample Normalized Data Items\u003c/p\u003e \u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab6\" class=\"InternalRef\"\u003e6\u003c/span\u003e contains sample data items from the age and data-location datasets. The columns represent the progression of these data items through the normalization process, with changes made by the character, word, and phrase normalization parts of the code represented in those respective columns. Note that for the data-location dataset, the list-like phrase-normalized strings are split into individual TSV rows for each data item in the list, e.g., the single input data item \u0026ldquo;Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eA,B,C, Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e6\u003c/span\u003e.\u0026rdquo; becomes four output data items: \u0026ldquo;figure \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ea\u0026rdquo;, \u0026ldquo;figure \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eb\u0026rdquo;, \u0026ldquo;figure \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ec\u0026rdquo;, and \u0026ldquo;figure \u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e6\u003c/span\u003e\u0026rdquo;.\u003c/p\u003e \u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab6\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 6\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eSample Data Items at Each Stage\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"5\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDataset\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eBefore Normalization\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eCharacter Normalized\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eWord Normalized\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003ePhrase Normalized\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAge\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eSix week old\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003esix week old\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e6 week\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e6 week\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAge\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e6\u0026ndash;8 week\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e6 to 8-week old\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e6 to 8-week\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e6\u0026ndash;8 week\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAge\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eMedian age 6.3 years\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003emedian age 6.3 years\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003emedian 6.3 year\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003emedian: 6.3 year\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eData-Location\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eAdditional File 4, Tables\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e and \u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eadditional file 4, Tables \u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e and \u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eadditional file 4, Table \u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e and \u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e['additional file 4', 'Table \u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e', 'Table \u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e']\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eData-Location\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eFigure\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eA,B,C, Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e6\u003c/span\u003e.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eFigure\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ea,b,c, Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e6\u003c/span\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003efigure \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ea,b,c, Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e6\u003c/span\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e['figure \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ea', 'figure \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eb', 'figure \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ec', 'figure \u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e6\u003c/span\u003e']\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eData-Location\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eFigure \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eA,B, Suppl Fig.\u0026nbsp;2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003efigure \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ea,b, suppl Fig.\u0026nbsp;2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003efigure \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ea,b, supplemental Fig.\u0026nbsp;2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e['figure \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ea', 'figure \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eb', 'supplemental Fig.\u0026nbsp;2']\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e"},{"header":"Results","content":"\u003cp\u003eUsing ADP\u0026rsquo;s normalization scripts on the IEDB age and data-location datasets demonstrates that it is possible to use ADP to effect significant improvements to the overall standardization of a dataset.\u003c/p\u003e \u003cp\u003eUser Action Efficiency\u003c/p\u003e \u003cp\u003eADP is a tool for the development and implementation of standardization rules. Accordingly, the thoroughness with which a user makes action decisions (in the character and word stages) and builds phrase type patterns (in the phrase stage) determines the overall success of ADP at standardizing a dataset. The data presented in this manuscript is the result of a non-exhaustive approach to both datasets in which rule-setting for particularly common characters, words, and phrases was prioritized, to represent a practical and realistic normalization outcome.\u003c/p\u003e \u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab7\" class=\"InternalRef\"\u003e7\u003c/span\u003e provides an overview of the extent of the normalization rule-setting done for each dataset. The \u0026ldquo;items in review\u0026rdquo; counts reflect the number of characters or words for which action decisions were not made at the time of manuscript submission. The \u0026ldquo;items in reference\u0026rdquo; counts reflect the number of characters or words for which action decisions were made. The \u0026ldquo;phrase-type patterns\u0026rdquo; counts reflect the number of user-generated patterns against which phrases are matched to determine their validity, and \u0026ldquo;valid phrase-type patterns\u0026rdquo; reflect how many of the defined patterns are specified as valid phrases.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab7\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 7\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eNumber of Action Decisions by Dataset\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"3\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAge Dataset\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eData-Location Dataset\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCharacters in review\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e7\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eWords in review\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e84\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e1160\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCharacters in reference\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e21\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e39\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eWords in reference\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e94\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e5780 counting mass-allowed Protein Data Bank (PDB) identifiers, otherwise 186\u003csup\u003e3\u003c/sup\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePhrase-type patterns\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e16\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e12\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eValid phrase-type patterns\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e11\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eThe results presented in this manuscript are accordingly the results of a fairly conservative rule-setting effort intended to prioritize the creation of rules targeting high-occurrence characters, words, and phrase patterns. More comprehensive normalization and higher validity rates at each stage could be achieved by targeting increasingly lower-frequency characters, words, and phrases. Ultimately, reasonable stopping points will vary for each dataset; making action decisions and creating phrase patterns for increasingly infrequent characters, words, and phrases offers diminishing returns in overall dataset standardization.\u003c/p\u003e \u003cp\u003eValidity Rates by Dataset and Stage\u003c/p\u003e \u003cp\u003eADP validates data items at each stage. In the character stage, data items pass validation if they contain only characters that have been marked as allowed. Data items pass validation at the word stage if they contain only words that have been marked as allowed. In the phrase stage, data items pass validation if they match to a pattern designated as valid. The word and phrase stages only attempt to normalize data items that have passed validation in the previous stage(s). The Validation Results by Dataset and Stage figures below show the rates of validity achieved with the aforementioned non-exhaustive rule-setting approach.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eIn the character stage, validity rates for both datasets are above 99%. These character validation results were achieved following 21 action decisions for the age dataset and 39 action decisions for the data-location dataset in the character normalization stage (see Table\u0026nbsp;\u003cspan refid=\"Tab7\" class=\"InternalRef\"\u003e7\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eIn the word stage, validity rates for both data sets are above 98%. These word validation results were achieved following 94 action decisions for the age dataset and 187 action decisions\u003csup\u003e4\u003c/sup\u003e for the data-location dataset in the character normalization stage (see Table\u0026nbsp;\u003cspan refid=\"Tab7\" class=\"InternalRef\"\u003e7\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eThe age dataset\u0026rsquo;s validity rate at the phrase stage is significantly lower than that of the data-location dataset: 83.8% of data items pass phrase validation in the age dataset, while 97.9% of data items pass phrase validation in the data-location dataset. This is the result of a relatively large number of data items that match invalid patterns. In particular, for the age dataset, numerical exact values (e.g., \u0026ldquo;7\u0026rdquo;) and ranges without units (e.g., \u0026ldquo;8\u0026ndash;10\u0026rdquo;) are designated as invalid phrase types because that age dataset contains ages expressed in hours, days, weeks, months, and years, so any number-unit combination is theoretically possible; without a unit, numerical age values are practically meaningless. As is recorded in the phrase-normalized age dataset file, of the 1019 data items that failed phrase validation, only 105 (1.47% of all data items) failed because they did not match any pattern; all the rest failed because they matched a pattern designated as invalid.\u003c/p\u003e \u003cp\u003eThese phrase validation results were achieved by matching against 16 phrase-type patterns for the age dataset and 12 patterns for the data-location dataset (see Table\u0026nbsp;\u003cspan refid=\"Tab7\" class=\"InternalRef\"\u003e7\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eIt is evident that a relatively low number of user action decisions is sufficient to produce very high rates of validity in at least these two free-text datasets. Notably, in both the character and word stages, reaching similar results (\u0026gt;\u0026thinsp;99% validity in the character stage and \u0026gt;\u0026thinsp;98% validity in the word stage) in the two datasets required only about twice as many action decisions in the data-location dataset as in the age dataset, despite that the former dataset is more than 35 times longer than the latter.\u003c/p\u003e \u003cp\u003eExtent of Change to Data Items\u003c/p\u003e \u003cp\u003eIn the character and word stages, the values in the Levenshtein distance score columns (see Measuring String Change During Normalization above) serve as indicators of the extent to which strings are modified during the normalization process. Figures\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e and 5 show the frequency distributions of Levenshtein distance scores by dataset and stage. Note that the word stage figures for both datasets use a logarithmic scale for clarity.\u003c/p\u003e \u003cp\u003eFor the age dataset, Levenshtein distance score frequency graphs show that most data items receive little modification during the character and word stages. The notable spike at a score of 1 in the word stage results from the abundance of age data items with plural units that were normalized to singular; the score of 1 frequently represents the removal of an \u0026ldquo;s\u0026rdquo; from \u0026ldquo;years\u0026rdquo;, \u0026ldquo;months\u0026rdquo;, or \u0026ldquo;weeks.\u0026rdquo;\u003c/p\u003e \u003cp\u003eIn the data-location dataset, the uniform nature of much of the dataset (namely the \u0026gt;\u0026thinsp;200,000 lines of HLA Ligand Atlas URLs) produces other spikes in the character stage Levenshtein distance frequency chart. The spike at 9 is one such case. Of the 57,517 data-location data items with a Levenshtein distance score of 9 at the character stage, 91% (52,522) are HLA Ligand Atlas URLs that have paths that a string of 9 uppercase letters (e.g., \u0026ldquo;\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://hla-ligand-atlas.org/peptide/AAAAAQSVY\u003c/span\u003e\u003cspan address=\"https://hla-ligand-atlas.org/peptide/AAAAAQSVY\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u0026rdquo;). The URLs resolve in the same way with lowercase and uppercase letters in that path; the former URL is functionally equivalent to \u0026ldquo;\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://hla-ligand-atlas.org/peptide/aaaaaqsvy\u003c/span\u003e\u003cspan address=\"https://hla-ligand-atlas.org/peptide/aaaaaqsvy\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u0026rdquo;, so normalizing to lowercase does not result in any lost meaning. Levenshtein distance scores at the word stage cluster strongly around 0 for the data-location dataset, a reflection of the fact that a large portion of the dataset, namely the URLs, received no word normalization.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003eFigure 5: Levenshtein Distance Scores by Stage, Data-Location Dataset\u003c/b\u003e \u003c/p\u003e \u003cp\u003eLevenshtein distance ceases to be a useful metric at the phrase stage, at which it is often desirable to make significant changes to the overall structure of the data item. Straightforward and benign changes like alterations in word order produce high Levenshtein distances. Accordingly, Levenshtein distance scores are not tracked at the phrase stage.\u003c/p\u003e \u003cp\u003eData-Location Phrase Splitting and Phrase-Part Validity\u003c/p\u003e \u003cp\u003eBecause the data-location dataset included a high number of list-like inputs made up of several individual data locations, the data items in that dataset were put through a splitter script that divided list-like data items so that each output datum referenced exactly one data location (see Data-Location Splitting above).\u003c/p\u003e \u003cp\u003eFor this dataset, we calculate additional relevant metrics. Split phrase count (listed in the split_phrase_count column) refers to the total number of outputs split out of an original input data item; e.g., the input item \u0026ldquo;Table \u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e and Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e\u0026rdquo;, which is split into the data items \u0026ldquo;Table \u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e\u0026rdquo; and \u0026ldquo;figure \u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e\u0026rdquo;, has a split phrase count of 2. Validity rate is the number of valid output data items divided by the split phrase count. A validity rate of 1 means that every output data item that derives from a particular input data item is valid, while a validity rate of 0 means that none of those output data items are valid. Split phrase count and validity rate (along with all other analytics, like Levenshtein distance scores) are recorded in the phrase-normalized output file exactly once for each input data item so that means and frequency distributions of those metrics are not skewed by the row-count increase that occurs during phrase splitting.\u003c/p\u003e \u003cp\u003eAs is evident in Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e6\u003c/span\u003e, the large number of HLA Ligand Atlas URLs in the dataset concentrate both the split phrase count and validity score around 1, as the URLs are all unsplit and valid. Including URLs, the mean split phrase count is 1.24 (standard deviation 0.92), and the mean phrase validity rate is 1.00 (standard deviation 0.06).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eIt is noteworthy that the data items with high split phrase counts tend towards high validity rates. It appears that those data items tend to be simple and orderly lists, such as the data item \u0026ldquo;Figures \u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e, \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e, \u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e, \u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e, Supplementary Figs.\u0026nbsp;2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13\u0026rdquo;, which has a split phrase count of 16 and a validity rate of 1.0. Such data items are simpler to split, and their split outputs are individually simpler and more readily matchable to basic phrase patterns than the less uniform lists that occur towards the middle of the split phrase count range, such as \u0026ldquo;Table\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e and Figs.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e and \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e and Supporting Information S2 Figure\u0026rdquo; (split phrase count 4, validity rate 0.75).\u003c/p\u003e \u003cp\u003eWhen examining only non-URL data items, strong clustering around a validity rate of 1 remains, but with a more obvious spread of split phrase count values, as is evident in Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e7\u003c/span\u003e. Excluding URLs, the mean split phrase count is 3.19 (standard deviation 1.91), and the mean phrase validity rate is 0.96 (standard deviation 0.17).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eThe implementation of data-location phrase splitting achieves very high rates of validity even among the complicated minority made up of non-URL data items.\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eMeasuring Normalization Empirically\u003c/p\u003e \u003cp\u003eThe ADP toolset provides several metrics by which a user can measure the extent to which ADP normalization modifies the data, including Levenshtein distance scoring and validation pass/fail rates. These metrics are intended to approximate the degree to which the ADP normalization code improved the overall normality of the data without losing the original string\u0026rsquo;s meaning. However, empirically evaluating the success of the normalization process as a whole remains difficult due to the lack of a clear universal metric for dataset normalization. A useful future direction would be to establish an empirical way to measure degrees of standardization in unstructured datasets; ideally, such a metric would allow comparisons between free-text datasets\u0026rsquo; spelling accuracy, adherence to grammar, and stylistic consistency.\u003c/p\u003e \u003cp\u003eEvaluating the Utility of ADP\u003c/p\u003e \u003cp\u003eWhile the age and data-location datasets are distinct enough in size, content, and style to make a case for the flexibility of the ADP rule-setting framework for normalization, its use on these two datasets is not sufficient to demonstrate that ADP is a useful tool for a truly wide range of free-text datasets. Further experimentation with other free-text datasets will be necessary to ensure that ADP normalization is adaptable enough to be used with a diverse range of free-text datasets.\u003c/p\u003e \u003cp\u003eDeveloping frameworks for testing the accuracy of ADP\u0026rsquo;s outputs compared to other normalization methods is an active priority. ADP\u0026rsquo;s user-dependence is a design feature that was implemented specifically because we hypothesize that it will result in higher precision of normalization results compared to predictive normalization tools, which can struggle with certain context-specific normalization decisions, like handling instances of \u0026ldquo;cant\u0026rdquo; occurring as a synonym of \u0026ldquo;slang\u0026rdquo; rather than a misspelling of \u0026ldquo;can\u0026rsquo;t\u0026rdquo;, that humans can make quickly and accurately [5]. Future testing will likely include evaluating how effectively ADP normalization preserves the meaning of data items throughout the normalization process compared to analogous normalization tools.\u003c/p\u003e \u003cp\u003eSome recent tools for free-text standardization make use of large language models to perform standardization tasks. One such tool is CleanAgent, which uses an LLM agent to identify the types of data (e.g., phone number, email address, date) in each column of a CSV, write and run Python code to standardize each column\u0026rsquo;s data based on its type, and interact with the user throughout the standardization process [7]. CleanAgent\u0026rsquo;s \u0026ldquo;hands-off\u0026rdquo; approach to data standardization is intended to ensure ease of use and efficiency, as CleanAgent is designed to enable \u0026ldquo;data scientists to input their standardization requirements in one instance\u0026rdquo; [7]. At the time of the writing of this manuscript, the authors were unable to run CleanAgent on either the age or data-location datasets. We have reached out to the developers of CleanAgent about a recurring error message; we aim to do a direct comparison between the free-text normalization outcomes from CleanAgent and ADP in the future. The following analysis of CleanAgent\u0026rsquo;s utility is accordingly based solely on the contents of the CleanAgent demonstration publication [7].\u003c/p\u003e \u003cp\u003eCleanAgent\u0026rsquo;s column type identification processes appear to be somewhat limited. As shown in its 2024 demonstrational publication, CleanAgent\u0026rsquo;s column type identifier attributes the datatype \u0026ldquo;address\u0026rdquo; to the \u0026ldquo;Name\u0026rdquo; column of its test CSV, which stands out as a potential misidentification; the dataset used in that demonstration is not included with that paper nor does it appear to be listed in the GitHub repository for CleanAgent [8], so we were unable to verify if that is indeed a valid datatype for that column. Furthermore, as shown in that demonstration publication, CleanAgent did not identify a datatype for the columns named \u0026ldquo;AGE\u0026rdquo; and \u0026ldquo;weight__\u0026rdquo; [7], suggesting that CleanAgent may struggle to standardize columns of data that, unlike email addresses or phone numbers, lack a well-defined standard form. CleanAgent\u0026rsquo;s LLM-based normalization appears to be an efficient solution to the problem of standardizing fairly simple data items (e.g., dates or contact information), and it is designed to accomplish normalization with very little work on the user\u0026rsquo;s part. ADP, by comparison, requires more of the user\u0026rsquo;s time and effort but handles more complex and widely-varying data items.\u003c/p\u003e \u003cp\u003eLLM-based normalization tools are limited by the accuracy of the LLMs on which they depend. Output accuracy is a vitally important feature of standardization processes for biomedical datasets. Ostensibly, LLM accuracy will continue to improve, but in the meantime, there remains a need for high-precision standardization tools for use on datasets that require extremely high accuracy on free-text fields that are often far messier than comparatively simple \u0026ldquo;date\u0026rdquo; or \u0026ldquo;phone number\u0026rdquo; fields. ADP is designed for this niche. The authors intend to conduct further empirical evaluation of the accuracy of ADP and comparable LLM-based normalization tools.\u003c/p\u003e \u003cp\u003eProductive Value of Results\u003c/p\u003e \u003cp\u003eThese datasets contain all distinct age and data-location values recorded in the IEDB, but many of the individual values in these datasets represent thousands of instances of that value. As a result, the effect of normalizing these datasets is multiplied. The IEDB records more than 18\u0026nbsp;million total age values and more than 21\u0026nbsp;million total data-location values [4], so applying normalization to these data values in the IEDB will accordingly produce significant improvements to findability and usability of millions of lines of data.\u003c/p\u003e \u003cp\u003eImproving Data Findability\u003c/p\u003e \u003cp\u003eUsing the ADP normalization toolkit, we normalized the age and data-location free-text datasets from the IEDB, two datasets with very different content and normalization needs, in such a way that renders the data in these datasets searchable, findable, and ontologizable in a way that they simply were not before. Standardizing the text in these datasets will enable the forthcoming implementation of dedicated search tools for these datasets in the IEDB. For instance, IEDB users could query for data from experiments on mice less than 28 days old and receive results within that range including ages originally expressed with varying formats and units (e.g., \u0026lsquo;10\u0026ndash;20 days old\u0026rsquo;, \u0026lsquo;2 days\u0026rsquo;, \u0026lsquo;24 h\u0026rsquo;, \u0026lsquo;1\u0026ndash;3 wks\u0026rsquo;). Similarly, by ontologizing categorical age data items like \u0026lsquo;juvenile\u0026rsquo;, \u0026lsquo;calf\u0026rsquo;, \u0026lsquo;foal\u0026rsquo;, \u0026lsquo;piglet\u0026rsquo;, or \u0026lsquo;child\u0026rsquo;, we can enable searches for pre-adult life stages across species.\u003c/p\u003e \u003cp\u003eThe FAIR data principles identify searchability (principle F4) as a critical aspect of data findability, so improving the IEDB\u0026rsquo;s search functionalities is core to the IEDB\u0026rsquo;s effort to improve its overall data FAIRness [9]. The data-location dataset in particular was identified as a promising candidate for work to improve the IEDB\u0026rsquo;s FAIRness in a 2018 analysis of the IEDB\u0026rsquo;s adherence to the FAIR standards [3]. Accordingly, the normalization performed on the data-location dataset using ADP completes that long-standing goal and demonstrates the IEDB\u0026rsquo;s ongoing commitment to improving data FAIRness in immunology.\u003c/p\u003e \u003cp\u003eEnabling Ontologization of Free-Text Data\u003c/p\u003e \u003cp\u003eBy standardizing the characters, words, and phrase structures in free-text datasets, ADP makes it easier to ontologize those datasets. Several prior publications have illustrated the benefits of linkages between IEDB data and formal ontologies [2,3,10,11]. Already, many IEDB data fields are mapped to terms from a wide range of ontologies, such as the \u0026ldquo;Organism\u0026rdquo; field being mapped to NCBI Taxonomy [12] terms and the \u0026ldquo;Evidence Code\u0026rdquo; field being mapped to Evidence Ontology [13] terms [14]. By standardizing the terms in use in the age and data-location datasets, ADP normalization is an effective step towards ontologizing the data in these fields. In particular, promising next steps include the ontologization of units in the age dataset via the Unit Ontology [15] and document parts via the Information Artifact Ontology and Ontology for Biomedical Investigations [16]. Should ADP prove effective on other free-text datasets within and beyond the IEDB, it will make it possible to reap the benefits of ontologization from large amounts of previously underutilized biomedical data.\u003c/p\u003e"},{"header":"Conclusions","content":"\u003cp\u003eWhile further testing is necessary to validate ADP normalization on other datasets, preliminary evaluations of its application to the age and data-location datasets suggest that ADP normalization can produce high rates of output validity in diverse free-text datasets following a relatively low number of user action decisions.\u003c/p\u003e \u003cp\u003eThe Immune Epitope Database (IEDB) has made significant efforts over the past several years to improve its adherence to FAIR data standards through improvements to findability and interoperability of its data. Creating linkages with formal ontologies is a pillar of the IEDB\u0026rsquo;s efforts to improve interoperability, but these efforts have been concentrated on standardized datasets. The ability to standardize free-text datasets would enable further FAIRness and more effective utilization of the vast quantities of free-text data in the IEDB. Our preliminary results are promising indications that ADP normalization can standardize free-text datasets efficiently and accurately.\u003c/p\u003e"},{"header":"Declarations","content":" \u003cp\u003e \u003cstrong\u003eEthics Approval and Consent to Participate\u003c/strong\u003e \u003cp\u003eNot applicable.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eConsent for Publication\u003c/strong\u003e \u003cp\u003eNot applicable.\u003c/p\u003e \u003c/p\u003e\u003cp\u003e \u003ch2\u003eCompeting Interests\u003c/h2\u003e \u003cp\u003eThe authors declare that they have no competing interests.\u003c/p\u003e \u003c/p\u003e\u003ch2\u003eFunding\u003c/h2\u003e \u003cp\u003eResearch reported in this publication was supported by the National Institutes of Health contract 75N93019C00001 and grant U24CA248138.\u003c/p\u003e\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eAll authors contributed to the conception of this project. S.D. and J.B. designed and developed the ADP software, collected resulting data, and drafted this manuscript. B.P. and J.A.O. advised on the software design and data collection process. R.V. advised on datasets to target for normalization, assisted in collection of input data, and provided substantial feedback on the software design. All authors read and approved the final manuscript.\u003c/p\u003e\u003ch2\u003eAcknowledgments\u003c/h2\u003e \u003cp\u003eWe wish to acknowledge the entire IEDB and CEDAR development and curation team.\u003c/p\u003e\u003ch2\u003eData Availability\u003c/h2\u003e\u003cp\u003eAll code and data discussed in this manuscript is available in the following GitHub repository: https://github.com/sebastianduesing/adp\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eGandomi A, Haider M. Beyond the hype: Big data concepts, methods, and analytics. Int J Inf Manag. 2015 Apr 1;35(2):137\u0026ndash;44.\u003c/li\u003e\n\u003cli\u003eVita R, Mahajan S, Overton JA, Dhanda SK, Martini S, Cantrell JR, et al. The Immune Epitope Database (IEDB): 2018 update. Nucleic Acids Res. 2019 Jan 8;47(Database issue):D339\u0026ndash;43.\u003c/li\u003e\n\u003cli\u003eVita R, Overton JA, Mungall CJ, Sette A, Peters B. FAIR principles and the IEDB: short-term improvements and a long-term vision of OBO-foundry mediated machine-actionable interoperability. Database. 2018 Jan 1;2018:bax105.\u003c/li\u003e\n\u003cli\u003eDuesing S. sebastianduesing/adp [Internet]. 2024 [cited 2024 Jul 1]. Available from: https://github.com/sebastianduesing/adp\u003c/li\u003e\n\u003cli\u003eClark E, Araki K. Text Normalization in Social Media: Progress, Problems and Applications for a Pre-Processing System of Casual English. Procedia - Soc Behav Sci. 2011 Jan 1;27:2\u0026ndash;11.\u003c/li\u003e\n\u003cli\u003eSproat R, Black AW, Chen S, Kumar S, Ostendorf M, Richards CD. Normalization of non-standard words. Comput Speech Lang. 2001 Jul 1;15(3):287\u0026ndash;333.\u003c/li\u003e\n\u003cli\u003eQi D, Wang J. CleanAgent: Automating Data Standardization with LLM-based Agents [Internet]. arXiv; 2024 [cited 2024 Sep 30]. Available from: http://arxiv.org/abs/2403.08291\u003c/li\u003e\n\u003cli\u003esfu-db/CleanAgent: This is an experimental demo repository of agent on data cleaning task [Internet]. [cited 2024 Oct 1]. Available from: https://github.com/sfu-db/CleanAgent\u003c/li\u003e\n\u003cli\u003eGO FAIR [Internet]. [cited 2024 Jun 7]. F4: (Meta)data are registered or indexed in a searchable resource. Available from: https://www.go-fair.org/fair-principles/f4-metadata-registered-indexed-searchable-resource/\u003c/li\u003e\n\u003cli\u003eVita R, Overton JA, Greenbaum JA, Ponomarenko J, Clark JD, Cantrell JR, et al. The immune epitope database (IEDB) 3.0. Nucleic Acids Res. 2015 Jan 28;43(Database issue):D405\u0026ndash;12.\u003c/li\u003e\n\u003cli\u003eVita R, Overton JA, Sette A, Peters B. Better living through ontologies at the Immune Epitope Database. Database J Biol Databases Curation. 2017 Mar 18;2017:bax014.\u003c/li\u003e\n\u003cli\u003eDatabase Resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2017 Jan 4;45(Database issue):D12\u0026ndash;7.\u003c/li\u003e\n\u003cli\u003eChibucos MC, Mungall CJ, Balakrishnan R, Christie KR, Huntley RP, White O, et al. Standardized description of scientific evidence using the Evidence Ontology (ECO). Database J Biol Databases Curation. 2014;2014:bau075.\u003c/li\u003e\n\u003cli\u003eVita R, Overton JA, Peters B. Identification of errors in the IEDB using ontologies. Database J Biol Databases Curation. 2018 Feb 22;2018:bay005.\u003c/li\u003e\n\u003cli\u003eGkoutos GV, Schofield PN, Hoehndorf R. The Units Ontology: a tool for integrating units of measurement in science. Database J Biol Databases Curation. 2012 Oct 5;2012:bas033.\u003c/li\u003e\n\u003cli\u003eBandrowski A, Brinkman R, Brochhausen M, Brush MH, Bug B, Chibucos MC, et al. The Ontology for Biomedical Investigations. PLoS ONE. 2016 Apr 29;11(4):e0154556. DOCUMENT_PREFERENCES\u0026thinsp;\u0026lt;\u0026thinsp;data data-version=\"3\" zotero-version=\"6.0.37\"\u0026gt;\u0026lt;session id=\"padT9rKD\"/\u0026gt;\u0026lt;style id=\"http://www.zotero.org/styles/vancouver\" locale=\"en-US\" hasBibliography=\"1\" bibliographyStyleHasBeenSet=\"0\"/\u0026gt;\u0026lt;prefs\u0026thinsp;\u0026gt;\u0026thinsp;\u0026lt;\u0026thinsp;pref name=\"fieldType\" value=\"Field\"/\u0026gt;\u0026lt;pref name=\"automaticJournalAbbreviations\" value=\"true\"/\u0026gt;\u0026lt;/prefs\u0026gt;\u0026lt;/data\u0026gt;\u003c/li\u003e\n\u003c/ol\u003e"},{"header":"Footnotes","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003e The reference files can be found at the following paths in the repository:\u003c/span\u003e\u003cdiv id=\"Par26\" class=\"Para\"\u003eage/output_files/char_reference.tsv\u003c/div\u003e\u003cdiv id=\"Par27\" class=\"Para\"\u003eage/output_files/word_reference.tsv\u003c/div\u003e\u003cdiv id=\"Par28\" class=\"Para\"\u003edata_loc/output_files/char_reference.tsv\u003c/div\u003e\u003cdiv id=\"Par29\" class=\"Para\"\u003edata_loc/output_files/word_reference.tsv\u003c/div\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003e The phrase type files can be found at the following paths in the repository:\u003c/span\u003e\u003cdiv id=\"Par43\" class=\"Para\"\u003eage/input_files/age_phrase_types.tsv\u003c/div\u003e\u003cdiv id=\"Par44\" class=\"Para\"\u003edata_loc/input_files/data_loc_phrase_types.tsv\u003c/div\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003e The data-location dataset contained a large number of Protein Data Bank (PDB) identifiers that parsed as distinct words. These IDs, which follow a standard four-character alphanumeric format, were selected using a regular expression and then mass-allowed. There are 186 non-PDB-ID words in the data-location reference file.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003e 186 row-by-row action decisions plus one mass-allow of Protein Data Bank (PDB) identifiers, as described in footnote 1, performed via regular expression selection of the rows containing PDB identifiers.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"journal-of-biomedical-semantics","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"jbsm","sideBox":"Learn more about [Journal of Biomedical Semantics](http://jbiomedsem.biomedcentral.com/)","snPcode":"13326","submissionUrl":"https://submission.nature.com/new-submission/13326/3","title":"Journal of Biomedical Semantics","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"BMC/SO AJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"Unstructured data, free-text data, data normalization, data standardization, Immune Epitope Database, ontology.","lastPublishedDoi":"10.21203/rs.3.rs-5363542/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-5363542/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground\u003c/h2\u003e \u003cp\u003eWhile unstructured data, such as free text, constitutes a large amount of publicly available biomedical data, it is underutilized in automated analyses due to the difficulty of extracting meaning from it. Normalizing free-text data, \u003cem\u003ei.e.\u003c/em\u003e, removing inessential variance, enables the use of structured vocabularies like ontologies to represent the data and allow for harmonized queries over it. This paper presents an adaptable tool for free-text normalization and an evaluation of the application of this tool to two different sets of unstructured biomedical data curated from the literature in the Immune Epitope Database (IEDB): age and data-location.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eFree text entries for the database fields for subject age (4095 distinct values) and publication data-location (251,810 distinct values) in the IEDB were analyzed. Normalization was performed in three steps, namely character normalization, word normalization, and phrase normalization, using generalizable rules developed and applied with the tool presented in this manuscript. For the age dataset, in the character stage, the application of 21 rules resulted in 99.97% output validity; in the word stage, the application of 94 rules resulted in 98.06% output validity; and in the phrase stage, the application of 16 rules resulted in 83.81% output validity. For the data-location dataset, in the character stage, the application of 39 rules resulted in 99.99% output validity; in the word stage, the application of 187 rules resulted in 98.46% output validity; and in the phrase stage, the application of 12 rules resulted in 97.95% output validity.\u003c/p\u003e\u003ch2\u003eConclusions\u003c/h2\u003e \u003cp\u003eWe developed a generalizable approach for normalization of free text as found in database fields with content on a specific topic. Creating and testing the rules took a one-time effort for a given field that can now be applied to data as it is being curated. The standardization achieved in two datasets tested produces significantly reduced variance in the content which enhances the findability and usability of that data, chiefly by improving search functionality and enabling linkages with formal ontologies.\u003c/p\u003e","manuscriptTitle":"Standardizing Free-Text Data Exemplified by Age and Data-Location Fields in the Immune Epitope Database","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-11-14 11:26:35","doi":"10.21203/rs.3.rs-5363542/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2024-12-11T20:33:37+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2024-12-11T19:34:11+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2024-11-27T12:55:43+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2024-11-26T21:57:50+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"134094887854789762335134073578261829990","date":"2024-11-07T02:25:57+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"127896086113637218042817871986239353220","date":"2024-11-06T23:42:29+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"55044968070030563232438301056425783957","date":"2024-11-06T21:53:13+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"212643617591510795466706979686170545481","date":"2024-11-06T20:33:32+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2024-11-06T20:30:07+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2024-11-04T06:42:46+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2024-11-04T00:05:06+00:00","index":"","fulltext":""},{"type":"submitted","content":"Journal of Biomedical Semantics","date":"2024-10-30T21:35:34+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"journal-of-biomedical-semantics","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"jbsm","sideBox":"Learn more about [Journal of Biomedical Semantics](http://jbiomedsem.biomedcentral.com/)","snPcode":"13326","submissionUrl":"https://submission.nature.com/new-submission/13326/3","title":"Journal of Biomedical Semantics","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"BMC/SO AJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"157250f1-bd12-4482-93fb-47b0f8cf8dcd","owner":[],"postedDate":"November 14th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[],"tags":[],"updatedAt":"2025-03-24T15:58:51+00:00","versionOfRecord":{"articleIdentity":"rs-5363542","link":"https://doi.org/10.1186/s13326-025-00324-7","journal":{"identity":"journal-of-biomedical-semantics","isVorOnly":false,"title":"Journal of Biomedical Semantics"},"publishedOn":"2025-03-22 15:56:57","publishedOnDateReadable":"March 22nd, 2025"},"versionCreatedAt":"2024-11-14 11:26:35","video":"","vorDoi":"10.1186/s13326-025-00324-7","vorDoiUrl":"https://doi.org/10.1186/s13326-025-00324-7","workflowStages":[]},"version":"v1","identity":"rs-5363542","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-5363542","identity":"rs-5363542","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00