Research Overview

Dr. Guergana Savova's research interests are in natural language processing (NLP) especially as applied to the text generated by physicians (the clinical narrative). This is usually referred to as clinical NLP. She has been creating gold standard annotated resources based on computable definitions and developing methods for computable solutions. The focus of Dr. Savova's research is higher level semantic and discourse processing of the clinical narrative which includes tasks such as named entity recognition, event recognition, relation detection and classification including coreference and temporal relations. The methods are mostly machine learning spanning supervised, lightly supervised and completely unsupervised.

The result of Dr. Savova's research with her collaborators has led to the creation of the clinical Text Analysis and Knowledge Extraction System (cTAKES; http://sourceforge.net/projects/ohnlp/files/cTAKES/),  which has been released as an open source application under an Apache license. cTAKES is an information extraction system  comprising of a number of NLP components. cTAKES has been applied to a number of biomedical use cases to mine the data within the clinical narrative such as i2b2, PGRN and eMERGE to name a few. Within the Integrating Informatics and Biology to the Bedside (i2b2), cTAKES has been used to extract patient characteristics for determining their status related to a specific phenotype (Multiple Scleroris, Inflamatory Bowel Disease, Type 2 Diabetes). Within the Pharmacogenomics Research Network (PGRN), cTAKES has been applied to automatically determine patient's disease activity and detect responders versus non-responders to a specific treatment. Within the Electronic Medical Record and Genomics (eMERGE), cTAKES has been applied to automatically discover patients with Peripheral Arterial Disease.

Among some of Dr. Savova's NLP collaborators are Profs. Martha Palmer, James Martin and Wayne Ward from University of Colorado, Prof. Wendy Chapman from University of California at San Diego, Prof. Noemie Elhadad from Columbia University, Drs. Lynette Hirschman, Cheryl Clark and John Aberdeen from the MITRE Corporation, Prof. James Pustejovsky from Brandeis University, Prof. Rebecca Crowley from University of Pittsburgh. Dr. Savova is the recipient of NIH funding for multiple projects which are listed separately on this website.

Research Background

Dr. Guergana Savova is a reviewer for the Journal of the Medical Informatics Association (JAMIA), Journal of the Biomedical Informatics (JBI) and many conferences/workshops. She is also a member of the National Library of Medicine's Biomedical Library and Informatics Review Committee.

Dr. Guergana Savova holds a PhD in Linguistics with a minor in Cognitive Science and a Masters of Science in Computer Science from University of Minnesota. Before joining CHIP and HMS in 2010, Dr. Savova was member of the the Biomedical Statistics and Informatics Department faculty at the Mayo Clinic (2002-2010).

Publications

  1. The TRIPOD-LLM reporting guideline for studies using large language models. Nat Med. 2025 Jan; 31(1):60-69. View Abstract
  2. A New Era of Data-Driven Cancer Research and Care: Opportunities and Challenges. Cancer Discov. 2024 Oct 04; 14(10):1774-1778. View Abstract
  3. The TRIPOD-LLM Statement: A Targeted Guideline For Reporting Large Language Models Use. medRxiv. 2024 Jul 25. View Abstract
  4. Family history as the strongest predictor of aortic and peripheral aneurysms in patients with intracranial aneurysms. J Clin Neurosci. 2024 Aug; 126:128-134. View Abstract
  5. The effect of using a large language model to respond to patient messages. Lancet Digit Health. 2024 Jun; 6(6):e379-e381. View Abstract
  6. Evaluating the ChatGPT family of models for biomedical reasoning and classification. J Am Med Inform Assoc. 2024 04 03; 31(4):940-948. View Abstract
  7. Considerations for Prompting Large Language Models-Reply. JAMA Oncol. 2024 Apr 01; 10(4):538-539. View Abstract
  8. Large language models to identify social determinants of health in electronic health records. NPJ Digit Med. 2024 Jan 11; 7(1):6. View Abstract
  9. Improving model transferability for clinical note section classification models using continued pretraining. J Am Med Inform Assoc. 2023 12 22; 31(1):89-97. View Abstract
  10. DeepPhe-CR: Natural Language Processing Software Services for Cancer Registrar Case Abstraction. medRxiv. 2023 Oct 26. View Abstract
  11. Use of Artificial Intelligence Chatbots for Cancer Treatment Information. JAMA Oncol. 2023 10 01; 9(10):1459-1462. View Abstract
  12. DeepPhe-CR: Natural Language Processing Software Services for Cancer Registrar Case Abstraction. JCO Clin Cancer Inform. 2023 09; 7:e2300156. View Abstract
  13. End-to-end clinical temporal information extraction with multi-head attention. Proc Conf Assoc Comput Linguist Meet. 2023 Jul; 2023:313-319. View Abstract
  14. Natural Language Processing to Automatically Extract the Presence and Severity of Esophagitis in Notes of Patients Undergoing Radiotherapy. JCO Clin Cancer Inform. 2023 07; 7:e2300048. View Abstract
  15. Natural Language Processing Methods to Empirically Explore Social Contexts and Needs in Cancer Patient Notes. JCO Clin Cancer Inform. 2023 05; 7:e2200196. View Abstract
  16. Improving Model Transferability for Clinical Note Section Classification Models Using Continued Pretraining. medRxiv. 2023 Apr 24. View Abstract
  17. An End-to-End Natural Language Processing System for Automatically Extracting Radiation Therapy Events From Clinical Texts. Int J Radiat Oncol Biol Phys. 2023 09 01; 117(1):262-273. View Abstract
  18. Geometric Features Associated with Middle Cerebral Artery Bifurcation Aneurysm Formation: A Matched Case-Control Study. J Stroke Cerebrovasc Dis. 2022 Mar; 31(3):106268. View Abstract
  19. Open-source Software Sustainability Models: Initial White Paper From the Informatics Technology for Cancer Research Sustainability and Industry Partnership Working Group. J Med Internet Res. 2021 12 02; 23(12):e20028. View Abstract
  20. Tobacco use and age are associated with different morphologic features of anterior communicating artery aneurysms. Sci Rep. 2021 02 26; 11(1):4791. View Abstract
  21. Clinical Natural Language Processing for Radiation Oncology: A Review and Practical Primer. Int J Radiat Oncol Biol Phys. 2021 Jul 01; 110(3):641-655. View Abstract
  22. Morphological variables associated with ruptured basilar tip aneurysms. Sci Rep. 2021 01 28; 11(1):2526. View Abstract
  23. Geometric variations associated with posterior communicating artery aneurysms. J Neurointerv Surg. 2021 Nov; 13(11):1049-1052. View Abstract
  24. Vascular Geometry Associated with Anterior Communicating Artery Aneurysm Formation. World Neurosurg. 2021 02; 146:e1318-e1325. View Abstract
  25. Surrounding vascular geometry associated with basilar tip aneurysm formation. Sci Rep. 2020 10 21; 10(1):17928. View Abstract
  26. Adverse drug event presentation and tracking (ADEPT): semiautomated, high throughput pharmacovigilance using real-world data. JAMIA Open. 2020 Oct; 3(3):413-421. View Abstract
  27. Age and morphology of posterior communicating artery aneurysms. Sci Rep. 2020 07 14; 10(1):11545. View Abstract
  28. Mining Misdiagnosis Patterns from Biomedical Literature. AMIA Jt Summits Transl Sci Proc. 2020; 2020:360-366. View Abstract
  29. Interactive Exploration of Longitudinal Cancer Patient Histories Extracted From Clinical Text. JCO Clin Cancer Inform. 2020 05; 4:412-420. View Abstract
  30. Does BERT need domain adaptation for clinical negation detection? J Am Med Inform Assoc. 2020 04 01; 27(4):584-591. View Abstract
  31. Adverse drug event rates in pediatric pulmonary hypertension: a comparison of real-world data sources. J Am Med Inform Assoc. 2020 02 01; 27(2):294-300. View Abstract
  32. High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (PheCAP). Nat Protoc. 2019 12; 14(12):3426-3444. View Abstract
  33. Use of Narrative Concepts in Electronic Health Records to Validate Associations Between Genetic Factors and Response to Treatment of Inflammatory Bowel Diseases. Clin Gastroenterol Hepatol. 2020 07; 18(8):1890-1892. View Abstract
  34. Use of Natural Language Processing to Extract Clinical Cancer Phenotypes from Electronic Medical Records. Cancer Res. 2019 11 01; 79(21):5463-5470. View Abstract
  35. Morphological Variables Associated With Ruptured Middle Cerebral Artery Aneurysms. Neurosurgery. 2019 07 01; 85(1):75-83. View Abstract
  36. Supervised methods to extract clinical events from cardiology reports in Italian. J Biomed Inform. 2019 07; 95:103219. View Abstract
  37. Decreased Total Iron Binding Capacity May Correlate with Ruptured Intracranial Aneurysms. Sci Rep. 2019 04 15; 9(1):6054. View Abstract
  38. Potential Impact of Initial Clinical Data on Adjustment of Pediatric Readmission Rates. Acad Pediatr. 2019 07; 19(5):589-598. View Abstract
  39. Elevated International Normalized Ratio Is Associated With Ruptured Aneurysms. Stroke. 2018 09; 49(9):2046-2052. View Abstract
  40. Association between aspirin dose and subarachnoid hemorrhage from saccular aneurysms: A case-control study. Neurology. 2018 09 18; 91(12):e1175-e1181. View Abstract
  41. Low Serum Calcium and Magnesium Levels and Rupture of Intracranial Aneurysms. Stroke. 2018 07; 49(7):1747-1750. View Abstract
  42. Lipid-Lowering Agents and High HDL (High-Density Lipoprotein) Are Inversely Associated With Intracranial Aneurysm Rupture. Stroke. 2018 05; 49(5):1148-1154. View Abstract
  43. Clinical Natural Language Processing in languages other than English: opportunities and challenges. J Biomed Semantics. 2018 03 30; 9(1):12. View Abstract
  44. Antihyperglycemic Agents Are Inversely Associated With Intracranial Aneurysm Rupture. Stroke. 2018 01; 49(1):34-39. View Abstract
  45. Heroin Use Is Associated with Ruptured Saccular Aneurysms. Transl Stroke Res. 2018 08; 9(4):340-346. View Abstract
  46. DeepPhe: A Natural Language Processing System for Extracting Cancer Phenotypes from Clinical Records. Cancer Res. 2017 11 01; 77(21):e115-e118. View Abstract
  47. Capturing the Patient's Perspective: a Review of Advances in Natural Language Processing of Health-Related Text. Yearb Med Inform. 2017 Aug; 26(1):214-227. View Abstract
  48. Phelan-McDermid syndrome data network: Integrating patient reported outcomes with clinical notes and curated genetic reports. Am J Med Genet B Neuropsychiatr Genet. 2018 10; 177(7):613-624. View Abstract
  49. Association of intracranial aneurysm rupture with smoking duration, intensity, and cessation. Neurology. 2017 Sep 26; 89(13):1408-1415. View Abstract
  50. Alcohol Consumption and Aneurysmal Subarachnoid Hemorrhage. Transl Stroke Res. 2018 02; 9(1):13-19. View Abstract
  51. Towards generalizable entity-centric clinical coreference resolution. J Biomed Inform. 2017 05; 69:251-258. View Abstract
  52. Large-scale identification of patients with cerebral aneurysms using natural language processing. Neurology. 2017 Jan 10; 88(2):164-168. View Abstract
  53. An information model for computable cancer phenotypes. BMC Med Inform Decis Mak. 2016 09 15; 16(1):121. View Abstract
  54. Suboptimal Clinical Documentation in Young Children with Severe Obesity at Tertiary Care Centers. Int J Pediatr. 2016; 2016:4068582. View Abstract
  55. Electronic Health Record Based Algorithm to Identify Patients with Autism Spectrum Disorder. PLoS One. 2016; 11(7):e0159621. View Abstract
  56. Developing an Algorithm to Detect Early Childhood Obesity in Two Tertiary Pediatric Medical Centers. Appl Clin Inform. 2016 07 20; 7(3):693-706. View Abstract
  57. Normalizing acronyms and abbreviations to aid patient understanding of clinical texts: ShARe/CLEF eHealth Challenge 2013, Task 2. J Biomed Semantics. 2016 Jul 01; 7:43. View Abstract
  58. Comparative Effectiveness of Infliximab and Adalimumab in Crohn's Disease and Ulcerative Colitis. Inflamm Bowel Dis. 2016 Apr; 22(4):880-5. View Abstract
  59. PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability. J Am Med Inform Assoc. 2016 11; 23(6):1046-1052. View Abstract
  60. Identification of Nonresponse to Treatment Using Narrative Data in an Electronic Health Record Inflammatory Bowel Disease Cohort. Inflamm Bowel Dis. 2016 Jan; 22(1):151-8. View Abstract
  61. Semi-supervised Learning for Phenotyping Tasks. AMIA Annu Symp Proc. 2015; 2015:502-11. View Abstract
  62. Multilayered temporal modeling for the clinical domain. J Am Med Inform Assoc. 2016 Mar; 23(2):387-95. View Abstract
  63. Identification of subjects with polycystic ovary syndrome using electronic health records. Reprod Biol Endocrinol. 2015 Oct 29; 13:116. View Abstract
  64. Methods to Develop an Electronic Medical Record Phenotype Algorithm to Compare the Risk of Coronary Artery Disease across 3 Chronic Disease Cohorts. PLoS One. 2015; 10(8):e0136651. View Abstract
  65. An Introduction to Natural Language Processing: How You Can Get More From Those Electronic Notes You Are Generating. Pediatr Emerg Care. 2015 Jul; 31(7):536-41. View Abstract
  66. Development of phenotype algorithms using electronic medical records and incorporating natural language processing. BMJ. 2015 Apr 24; 350:h1885. View Abstract
  67. Developing a section labeler for clinical documents. AMIA Annu Symp Proc. 2014; 2014:636-44. View Abstract
  68. Automatic identification of methotrexate-induced liver toxicity in patients with rheumatoid arthritis from the electronic medical record. J Am Med Inform Assoc. 2015 Apr; 22(e1):e151-61. View Abstract
  69. Evaluating the state of the art in disorder recognition and normalization of the clinical narrative. J Am Med Inform Assoc. 2015 Jan; 22(1):143-54. View Abstract
  70. Temporal Annotation in the Clinical Domain. Trans Assoc Comput Linguist. 2014 Apr; 2:143-154. View Abstract
  71. Carrell et al. respond to "Observational research and the EHR". Am J Epidemiol. 2014 Mar 15; 179(6):762-3. View Abstract
  72. Using natural language processing to improve efficiency of manual chart abstraction in research: the case of breast cancer recurrence. Am J Epidemiol. 2014 Mar 15; 179(6):749-58. View Abstract
  73. Modeling disease severity in multiple sclerosis using electronic health records. PLoS One. 2013; 8(11):e78927. View Abstract
  74. Normalization and standardization of electronic health records for high-throughput phenotyping: the SHARPn consortium. J Am Med Inform Assoc. 2013 Dec; 20(e2):e341-8. View Abstract
  75. Discovering body site and severity modifiers in clinical texts. J Am Med Inform Assoc. 2014 May-Jun; 21(3):448-54. View Abstract
  76. Improved de-identification of physician notes through integrative modeling of both public and private medical text. BMC Med Inform Decis Mak. 2013 Oct 02; 13:112. View Abstract
  77. Automatic prediction of rheumatoid arthritis disease activity from the electronic medical records. PLoS One. 2013; 8(8):e69932. View Abstract
  78. Normalization of plasma 25-hydroxy vitamin D is associated with reduced risk of surgery in Crohn's disease. Inflamm Bowel Dis. 2013 Aug; 19(9):1921-7. View Abstract
  79. Improving case definition of Crohn's disease and ulcerative colitis in electronic medical records using natural language processing: a novel informatics approach. Inflamm Bowel Dis. 2013 Jun; 19(7):1411-20. View Abstract
  80. Formative evaluation of ontology learning methods for entity discovery by using existing ontologies as reference standards. Methods Inf Med. 2013; 52(4):308-16. View Abstract
  81. Towards comprehensive syntactic and semantic annotations of the clinical narrative. J Am Med Inform Assoc. 2013 Sep-Oct; 20(5):922-30. View Abstract
  82. Similar risk of depression and anxiety following surgery or hospitalization for Crohn's disease and ulcerative colitis. Am J Gastroenterol. 2013 Apr; 108(4):594-601. View Abstract
  83. Psychiatric co-morbidity is associated with increased risk of surgery in Crohn's disease. Aliment Pharmacol Ther. 2013 Feb; 37(4):445-54. View Abstract
  84. A common type system for clinical natural language processing. J Biomed Semantics. 2013 Jan 03; 4(1):1. View Abstract
  85. Large-scale evaluation of automated clinical note de-identification and its impact on information extraction. J Am Med Inform Assoc. 2013 Jan 01; 20(1):84-94. View Abstract
  86. Anaphoric reference in clinical reports: characteristics of an annotated corpus. J Biomed Inform. 2012 Jun; 45(3):507-21. View Abstract
  87. Building a robust, scalable and standards-driven infrastructure for secondary use of EHR data: the SHARPn project. J Biomed Inform. 2012 Aug; 45(4):763-71. View Abstract
  88. A system for coreference resolution for the clinical narrative. J Am Med Inform Assoc. 2012 Jul-Aug; 19(4):660-7. View Abstract
  89. Automated discovery of drug treatment patterns for endocrine therapy of breast cancer within an electronic medical record. J Am Med Inform Assoc. 2012 Jun; 19(e1):e83-9. View Abstract
  90. The MiPACQ clinical question answering system. AMIA Annu Symp Proc. 2011; 2011:171-80. View Abstract
  91. The SHARPn project on secondary use of Electronic Medical Record data: progress, plans, and possibilities. AMIA Annu Symp Proc. 2011; 2011:248-56. View Abstract
  92. Drug side effect extraction from clinical narratives of psychiatry and psychology patients. J Am Med Inform Assoc. 2011 Dec; 18 Suppl 1:i144-9. View Abstract
  93. Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions. J Am Med Inform Assoc. 2011 Sep-Oct; 18(5):540-3. View Abstract
  94. Coreference resolution: a review of general methodologies and applications in the clinical domain. J Biomed Inform. 2011 Dec; 44(6):1113-22. View Abstract
  95. Anaphoric relations in the clinical narrative: corpus creation. J Am Med Inform Assoc. 2011 Jul-Aug; 18(4):459-65. View Abstract
  96. The emerging role of electronic medical records in pharmacogenomics. Clin Pharmacol Ther. 2011 Mar; 89(3):379-86. View Abstract
  97. Discovering peripheral arterial disease cases from radiology notes using natural language processing. AMIA Annu Symp Proc. 2010 Nov 13; 2010:722-6. View Abstract
  98. Classification of medication status change in clinical narratives. AMIA Annu Symp Proc. 2010 Nov 13; 2010:762-6. View Abstract
  99. CNTRO: A Semantic Web Ontology for Temporal Relation Inferencing in Clinical Narratives. AMIA Annu Symp Proc. 2010 Nov 13; 2010:787-91. View Abstract
  100. Effectiveness of lexico-syntactic pattern matching for ontology enrichment with clinical documents. Methods Inf Med. 2011; 50(5):397-407. View Abstract
  101. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc. 2010 Sep-Oct; 17(5):507-13. View Abstract
  102. Leveraging informatics for genetic studies: use of the electronic medical record to enable a genome-wide association study of peripheral arterial disease. J Am Med Inform Assoc. 2010 Sep-Oct; 17(5):568-74. View Abstract
  103. The Rochester Epidemiology Project: exploiting the capabilities for population-based research in rheumatic diseases. Rheumatology (Oxford). 2011 Jan; 50(1):6-15. View Abstract
  104. Towards temporal relation discovery from the clinical narrative. AMIA Annu Symp Proc. 2009 Nov 14; 2009:568-72. View Abstract
  105. Mayo clinic smoking status classification system: extensions and improvements. AMIA Annu Symp Proc. 2009 Nov 14; 2009:619-23. View Abstract
  106. Discerning tumor status from unstructured MRI reports--completeness of information in existing reports and utility of automated natural language processing. J Digit Imaging. 2010 Apr; 23(2):119-32. View Abstract
  107. Automatically extracting cancer disease characteristics from pathology reports into a Disease Knowledge Representation Model. J Biomed Inform. 2009 Oct; 42(5):937-49. View Abstract
  108. The first step toward data reuse: disambiguating concept representation of the locally developed ICU nursing flowsheets. Comput Inform Nurs. 2008 Sep-Oct; 26(5):282-9. View Abstract
  109. Word sense disambiguation across two domains: biomedical literature and clinical notes. J Biomed Inform. 2008 Dec; 41(6):1088-100. View Abstract
  110. Extracting information from textual documents in the electronic health record: a review of recent research. Yearb Med Inform. 2008; 128-44. View Abstract
  111. Mayo clinic NLP system for patient smoking status identification. J Am Med Inform Assoc. 2008 Jan-Feb; 15(1):25-8. View Abstract
  112. Formalizing the International Classification of Functioning, Disability, and Health (ICF) using Formal Concept Analysis (FCA). AMIA Annu Symp Proc. 2007 Oct 11; 994. View Abstract
  113. Toward near real-time acuity estimation: a feasibility study. Nurs Res. 2007 Jul-Aug; 56(4):288-94. View Abstract
  114. Content coverage of SNOMED-CT toward the ICU nursing flowsheets and the acuity indicators. Stud Health Technol Inform. 2006; 122:722-6. View Abstract
  115. Building and evaluating annotated corpora for medical NLP systems. AMIA Annu Symp Proc. 2006; 1050. View Abstract
  116. Frame semantics and the domain of functioning, disability and health. AMIA Annu Symp Proc. 2005; 1106. View Abstract
  117. A term extraction tool for expanding content in the domain of functioning, disability, and health: proof of concept. J Biomed Inform. 2003 Aug-Oct; 36(4-5):250-9. View Abstract
  118. Testing the generalizability of the ISO model for nursing diagnoses. AMIA Annu Symp Proc. 2003; 274-8. View Abstract
  119. A data-driven approach for extracting "the most specific term" for ontology development. AMIA Annu Symp Proc. 2003; 579-83. View Abstract

Contact Guergana Savova