Full text
18,436 characters
· extracted from
preprint-html
· click to expand
Ten species comprise half of the bacteriology literature, leaving most species unstudied | bioRxiv /* */ /* */ <!-- <!-- /*! * yepnope1.5.4 * (c) WTFPL, GPLv2 */ (function(a,b,c){function d(a){return"[object Function]"==o.call(a)}function e(a){return"string"==typeof a}function f(){}function g(a){return!a||"loaded"==a||"complete"==a||"uninitialized"==a}function h(){var a=p.shift();q=1,a?a.t?m(function(){("c"==a.t?B.injectCss:B.injectJs)(a.s,0,a.a,a.x,a.e,1)},0):(a(),h()):q=0}function i(a,c,d,e,f,i,j){function k(b){if(!o&&g(l.readyState)&&(u.r=o=1,!q&&h(),l.onload=l.onreadystatechange=null,b)){"img"!=a&&m(function(){t.removeChild(l)},50);for(var d in y[c])y[c].hasOwnProperty(d)&&y[c][d].onload()}}var j=j||B.errorTimeout,l=b.createElement(a),o=0,r=0,u={t:d,s:c,e:f,a:i,x:j};1===y[c]&&(r=1,y[c]=[]),"object"==a?l.data=c:(l.src=c,l.type=a),l.width=l.height="0",l.onerror=l.onload=l.onreadystatechange=function(){k.call(this,r)},p.splice(e,0,u),"img"!=a&&(r||2===y[c]?(t.insertBefore(l,s?null:n),m(k,j)):y[c].push(l))}function j(a,b,c,d,f){return q=0,b=b||"j",e(a)?i("c"==b?v:u,a,b,this.i++,c,d,f):(p.splice(this.i++,0,a),1==p.length&&h()),this}function k(){var a=B;return a.loader={load:j,i:0},a}var l=b.documentElement,m=a.setTimeout,n=b.getElementsByTagName("script")[0],o={}.toString,p=[],q=0,r="MozAppearance"in l.style,s=r&&!!b.createRange().compareNode,t=s?l:n.parentNode,l=a.opera&&"[object Opera]"==o.call(a.opera),l=!!b.attachEvent&&!l,u=r?"object":l?"script":"img",v=l?"script":u,w=Array.isArray||function(a){return"[object Array]"==o.call(a)},x=[],y={},z={timeout:function(a,b){return b.length&&(a.timeout=b[0]),a}},A,B;B=function(a){function b(a){var a=a.split("!"),b=x.length,c=a.pop(),d=a.length,c={url:c,origUrl:c,prefixes:a},e,f,g;for(f=0;f<d;f++)g=a[f].split("="),(e=z[g.shift()])&&(c=e(c,g));for(f=0;f<b;f++)c=x[f](c);return c}function g(a,e,f,g,h){var i=b(a),j=i.autoCallback;i.url.split(".").pop().split("?").shift(),i.bypass||(e&&(e=d(e)?e:e[a]||e[g]||e[a.split("/").pop().split("?")[0]]),i.instead?i.instead(a,e,f,g,h):(y[i.url]?i.noexec=!0:y[i.url]=1,f.load(i.url,i.forceCSS||!i.forceJS&&"css"==i.url.split(".").pop().split("?").shift()?"c":c,i.noexec,i.attrs,i.timeout),(d(e)||d(j))&&f.load(function(){k(),e&&e(i.origUrl,h,g),j&&j(i.origUrl,h,g),y[i.url]=2})))}function h(a,b){function c(a,c){if(a){if(e(a))c||(j=function(){var a=[].slice.call(arguments);k.apply(this,a),l()}),g(a,j,b,0,h);else if(Object(a)===a)for(n in m=function(){var b=0,c;for(c in a)a.hasOwnProperty(c)&&b++;return b}(),a)a.hasOwnProperty(n)&&(!c&&!--m&&(d(j)?j=function(){var a=[].slice.call(arguments);k.apply(this,a),l()}:j[n]=function(a){return function(){var b=[].slice.call(arguments);a&&a.apply(this,b),l()}}(k[n])),g(a[n],j,b,n,h))}else!c&&l()}var h=!!a.test,i=a.load||a.both,j=a.callback||f,k=j,l=a.complete||f,m,n;c(h?a.yep:a.nope,!!i),i&&c(i)}var i,j,l=this.yepnope.loader;if(e(a))g(a,0,l,0);else if(w(a))for(i=0;i (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0];var j=d.createElement(s);var dl=l!='dataLayer'?'&l='+l:'';j.src='//www.googletagmanager.com/gtm.js?id='+i+dl;j.type='text/javascript';j.async=true;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-M677548'); Skip to main content Home About Submit ALERTS / RSS Search for this keyword Advanced Search New Results Ten species comprise half of the bacteriology literature, leaving most species unstudied Paul A. Jensen doi: https://doi.org/10.1101/2025.01.04.631297 Paul A. Jensen 1 Department of Biomedical Engineering, University of Michigan , Ann Arbor, MI, USA 2 Department of Chemical Engineering, University of Michigan , Ann Arbor, MI, USA Find this author on Google Scholar Find this author on PubMed Search for this author on this site For correspondence: pjens{at}umich.edu Abstract Full Text Info/History Metrics Preview PDF Abstract Microbiology research has historically focused on a few species of model organisms. Our bibliographic analysis finds extreme bias in the distribution of bacteriology research across species, with half of all papers referencing only ten species and 74% of all known species remaining unstudied. Microbiologists will need to broaden their perspective and embrace complexity to develop a complete understanding of the microbial world. The microbiome revolution moved the goalposts for microbiologists. For centuries, microbiology has focused intensely on understanding a relatively small number of microbes. These model species were selected for their importance to health, the environment, industry, or simply because the species were easy to work with. Microbiologists maintained their focus throughout the molecular, genetic, and genomic revolutions, but the metagenomic revolution made it impossible to ignore the thousands of understudied species found in every facet of our world ( Dewhirst et al. 2010 ; Quast et al. 2013 ; Parks et al. 2018 ). The scientific rise of the microbiome is exciting, but it presents an enormous practical challenge for microbiology. If it took centuries to learn the details of only a few model species, how can we ever understand the thousands of newfound species? To illustrate the paucity of data on understudied microbes, we performed a bibliometric analysis to quantify the uneven distribution of microbiology research. Release 202 of the GTDB database ( Parks et al. 2022 ) includes 43,409 unique species, and we counted the number of PubMed articles that refer to each species in their title or abstract. The results were heavily skewed. Almost 74% of all known species have never been the subject of a scientific publication—these are unstudied bacteria ( Figure 1a ). Even among the species studied (those with at least one publication), 50% of all articles refer to only ten species ( Figure 1b ). More than 90% of all bacteriology articles study fewer than 1% of the species, creating a “long tail” of understudied microbes. Download figure Open in new tab Figure 1: A. Most species of bacteria have never been the subject of a scientific paper. We counted how many papers in the PubMed database (up to November 1, 2024) reference on each of 43,409 species of bacteria in their title or abstract. Nearly 74% of species have never been studied (gray dots). The final 1% of most-studied species are the subject of 91.5% of papers, with 10 species appearing in 51.0% of all papers. B . Publications are unevenly distributed across the 50 most studied bacteria. The blue bar represents the relative fraction of papers published on each species. The number of articles in PubMed and the percentage of all bacteriology papers published appear to the left. C . The number of bacteriology papers published each year (black) grows slower than the cumulative number of species that have been the subject of a paper in PubMed (blue). Note the logarithmic scaling for both papers and species. While the number of papers is increasing, the number of papers per species has decreased since 1990, as revealed by panel D . It is important to remember that the number of papers published on each species remains highly skewed, and that the plots in C and D exclude the 74% of species that have never been the subject of a scientific paper. The scientific enterprise is expanding, and every year scientists publish 4–5% more papers than the previous year ( National Science Foundation and National Science Board 2021 ). It is tempting to think that the increase in scientific output will overcome the long tail of microbes, that is, scientists will eventually get around to studying every species. Unfortunately, the number of species discovered each year outpaces the increases in scientific output ( Figure 1c ). Between the years 1990–2020, the number of papers published per studied species of bacteria decreased by 60% ( Figure 1d ). Thus our knowledge density—the amount we learn per species—is actually decreasing. Our view of bacterial diversity is biased when so much of our understanding comes from so few microbes. Microbiologist Jeffery Gralnick once quipped that “ E. coli is a great model organism—for E. coli .” Gralnick’s comment referenced the discovery of anomalies (relative to E. coli ) in the TCA cycle of Shewanella oneidensis ( Brutinel and Gralnick 2012 ). Although S. oneidensis has 201-fold fewer citations that E. coli , it is arguably not an understudied species. Our analysis ranks it as the 94th most studied bacterium, which is in the top 2.17% of all species. Even the introduction to Gralnick’s aforementioned paper refers to S. oneidensis as a “model environmental organism”. If differences like S. oneidensis ’ TCA cycle can be found just outside the microbial 2%, imagine the diversity that lies in the other 98% of microbes. How can microbiologists catch up to the exploding tree of life? We propose two grand challenges for training a generation of microbiologists who can tackle the diversity of the microbial world. First, we need to embrace multifactorial experiment design. There are far too many species, strains, genes, environments, stressors, and phenotypes to study one at a time. Statisticians have taught for decades that the most efficient and robust experimental designs vary multiple factors simultaneously and then deconvolve the effects and interactions with simple statistical models ( Fisher 1935 ). Despite the sound theoretical basis for multifactorial experiments, biologists are routinely taught that “good” experiments vary only a single factor. Teaching multifactorial design would improve the efficiency of microbiologists and illuminate many of the interactions between genes, environments, and cells. Our second suggestion is to focus on the production of knowledge, not the collection of data. Microbiology is awash in big data, but knowledge production remains bottlenecked by the limited supply of human microbiologists. Statistical and computational tools help distill data for humans, but techniques based on pattern recognition emphasize commonalities between microbes rather than the unique features of each species. The databases themselves present a skewed view of microbial diversity. For example, more than half of the 791 transcriptomics experiments in the BV-BRC database come from five species ( Olson et al. 2023 ). Even if we developed tools to convert all these data into biological knowledge, that knowledge would illuminate a tiny slice of the microbial world. Attempts to repurpose data for new species depend, ironically, on how similar that new species is to our model bacteria. Instead, our knowledge of truly understudied species progresses slowly as scientists publish individual papers using bespoke experiments from their own laboratories. Automating microbiology with robotics and artificial intelligence will accelerate our field (King et al. 2009; Dama et al. 2023), but we need to apply these tools to the myriad species that live in the understudied corners of our world. Finally, we note that our analysis of microbial diversity includes only a small slice of bacteriology. We did not analyze the literature for viruses, archaea, fungi, or other microbes. Thousands of papers study microbial communities or the microbiome as a whole, but our bibliographic searches only identified papers that named species in their title or abstract. Other papers, such as metagenomic analyses of complex communities, may associate several species with diseases or ecological niches, but our analysis does not capture the species named in the main text, tables, or supplements of these papers. Our study thus reinforces the knowledge gap between microbes that are only studied en masse as communities and those select few species whose molecular, genetic, or physiological diversity is studied in detail. Methods Our bibliographic searchers used the NCBI Entrez Direct E-utilities software (Sayers et al. 2024). We searched for each species’ full name (“ Staphylococcus aureus ”) and abbreviated name (“ S. aureus ”). Any subspecies or strain identifiers in the GTDB database were removed before searching, and duplicate species names were removed. Some species share abbreviated names, e.g. Staphylococcus aureus and Streptomyces aureus ; in these cases, we performed separate searches using the full and abbreviated names and kept the only full name results for the species with fewer full name results. Full name searches were also used for species with abbreviated names that form common words, such as Aminobacterium mobile = A. mobile . These cases were manually identified by their aberrantly high ratio of abbreviated name results to full name results. All analyses were performed in the R programming language (R Core Team). Visualization were created with pgfplots ( Feuersaenger 2020 ). Data and Availability All code and data used in our analysis is available on our lab website ( http://jensenlab.net/publications ). Funding This work was supported by the National Institutes of Health (grant GM138210). References ↵ Brutinel , Evan D. and Jeffrey A. Gralnick ( Oct . 2012 ). “ Anomalies of the anaerobic tricarboxylic acid cycle in Shewanella oneidensis revealed by Tn-seq ”. en. In: Molecular Microbiology 86 . 2 , pp. 273 – 283 . issn: 0950-382X , 1365-2958. doi: 10.1111/j.1365-2958.2012.08196.x . url: https://onlinelibrary.wiley.com/doi/10.1111/j.1365-2958.2012.08196.x (visited on 06/04/2024). OpenUrl CrossRef PubMed Dama , Adam C. et al. ( May 2023 ). “ BacterAI maps microbial metabolism without prior knowledge ”. eng. In: Nature Microbiology . issn: 2058-5276 . doi: 10.1038/s41564-023-01376-0 . OpenUrl CrossRef ↵ Dewhirst , Floyd E. et al. ( Oct . 2010 ). “ The human oral microbiome ”. eng. In: Journal of Bacteriology 192 . 19 , pp. 5002 – 5017 . issn: 1098-5530 . doi: 10.1128/JB.00542-10 . OpenUrl Abstract / FREE Full Text ↵ Feuersaenger , Christian ( 2020 ). pgfplots . url: https://github.com/pgf-tikz/pgfplots . ↵ Fisher , Ronald ( 1935 ). The Design of Experiments . King , Ross D. et al. ( Apr . 2009 ). “ The automation of science ”. eng. In: Science (New York, N.Y .) 324 . 5923 , pp. 85 – 89 . issn: 1095-9203 . doi: 10.1126/science.1165620 . OpenUrl Abstract / FREE Full Text ↵ National Science Foundation and National Science Board ( 2021 ). Publications Output: U.S. Trends and International Comparisons . Tech. rep . url: https://ncses.nsf.gov/pubs/nsb20206/executive-summary (visited on 08/29/2021). ↵ Olson , Robert D. et al. ( Jan . 2023 ). “ Introducing the Bacterial and Viral Bioinformatics Resource Center (BV-BRC): a resource combining PATRIC, IRD and ViPR ”. eng. In: Nucleic Acids Research 51 . D1 , pp. D678 – D689 . issn: 1362-4962 . doi: 10.1093/nar/gkac1003 . OpenUrl CrossRef PubMed ↵ Parks , Donovan H. et al. ( Nov . 2018 ). “ A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life ”. en. In: Nature Biotechnology 36 . 10 . Bandiera_abtest: a Cg_type: Nature Research Journals Number: 10 Primary_atype: Research Publisher: Nature Publishing Group Subject_term: Bacteria;Phylogenetics;Taxonomy Subject_term_id: bacteria;phylogenetics;taxonomy, pp. 996 – 1004 . issn: 1546-1696 . doi: 10.1038/nbt.4229 . url: https://www.nature.com/articles/nbt.4229 (visited on 08/29/2021). OpenUrl CrossRef PubMed ↵ Parks , Donovan H. et al. ( Jan . 2022 ). “ GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy ”. en. In: Nucleic Acids Research 50 . D1 , pp. D785 – D794 . issn: 0305-1048 , 1362-4962. doi: 10.1093/nar/gkab776 . url: https://academic.oup.com/nar/article/50/D1/D785/6370255 (visited on 06/04/2024). OpenUrl CrossRef PubMed ↵ Quast , Christian et al. ( Jan . 2013 ). “The SILVA ribosomal RNA gene database project: improved data processing and web-based tools” . In: Nucleic Acids Research 41 . Database issue , pp. D590 – D596 . issn: 0305-1048 . doi: 10.1093/nar/gks1219 . url: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531112/ (visited on 08/29/2021). OpenUrl CrossRef PubMed Web of Science Sayers , Eric W. et al. ( Jan . 2024 ). “ Database resources of the National Center for Biotechnology Information ”. eng. In: Nucleic Acids Research 52 . D1 , pp. D33 – D43 . issn: 1362-4962 . doi: 10.1093/nar/gkad1044 . OpenUrl CrossRef PubMed View the discussion thread. Back to top Previous Next Posted January 04, 2025. Download PDF Email Thank you for your interest in spreading the word about bioRxiv. NOTE: Your email address is requested solely to identify you as the sender of this article. Your Email * Your Name * Send To * Enter multiple addresses on separate lines or separate them with commas. You are going to email the following Ten species comprise half of the bacteriology literature, leaving most species unstudied Message Subject (Your Name) has forwarded a page to you from bioRxiv Message Body (Your Name) thought you would like to see this page from the bioRxiv website. Your Personal Message CAPTCHA This question is for testing whether or not you are a human visitor and to prevent automated spam submissions. Share Ten species comprise half of the bacteriology literature, leaving most species unstudied Paul A. Jensen bioRxiv 2025.01.04.631297; doi: https://doi.org/10.1101/2025.01.04.631297 Share This Article: Copy Citation Tools Ten species comprise half of the bacteriology literature, leaving most species unstudied Paul A. Jensen bioRxiv 2025.01.04.631297; doi: https://doi.org/10.1101/2025.01.04.631297 Citation Manager Formats BibTeX Bookends EasyBib EndNote (tagged) EndNote 8 (xml) Medlars Mendeley Papers RefWorks Tagged Ref Manager RIS Zotero Tweet Widget Facebook Like Google Plus One Subject Area Microbiology Subject Areas All Articles Animal Behavior and Cognition (7629) Biochemistry (17660) Bioengineering (13881) Bioinformatics (41909) Biophysics (21435) Cancer Biology (18576) Cell Biology (25479) Clinical Trials (138) Developmental Biology (13366) Ecology (19887) Epidemiology (2067) Evolutionary Biology (24301) Genetics (15598) Genomics (22482) Immunology (17726) Microbiology (40359) Molecular Biology (17162) Neuroscience (88529) Paleontology (666) Pathology (2830) Pharmacology and Toxicology (4820) Physiology (7636) Plant Biology (15125) Scientific Communication and Education (2044) Synthetic Biology (4290) Systems Biology (9817) Zoology (2269)
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.