Assessment of artificial intelligence chatbot generated patient information on head and neck surgery

Hyuk Jin Kwun; Nick Lilic; Omid Ahmadi

doi:10.21037/ajo-25-43

Original Article

Assessment of artificial intelligence chatbot generated patient information on head and neck surgery

Hyuk Jin Kwun¹, Nick Lilic¹, Omid Ahmadi^1,2

¹Department of Otolaryngology and Head and Neck Surgery, Auckland City Hospital, Auckland, New Zealand; ²Department of Surgery, School of Medicine, University of Auckland, Auckland, New Zealand

Contributions: (I) Conception and design: All authors; (II) Administrative support: O Ahmadi, N Lilic; (III) Provision of study materials or patients: HJ Kwun, O Ahmadi; (IV) Collection and assembly of data: HJ Kwun, O Ahmadi; (V) Data analysis and interpretation: HJ Kwun, O Ahmadi; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Omid Ahmadi, MBChB, PhD. Honorary Senior Lecturer, Department of Surgery, School of Medicine, University of Auckland, Private Bag 92019, Auckland 1142, New Zealand; Department of Otolaryngology and Head and Neck Surgery, Auckland City Hospital, Auckland, New Zealand. Email: OmidA@adhb.govt.nz.

Background: Artificial intelligence (AI) chatbots are increasingly used by patients to obtain medical information, yet the readability and reliability of their content remain uncertain. Previous studies show that online surgical resources often exceed recommended health literacy levels. This study evaluates the readability and reliability of health information generated by AI chatbots for common head and neck surgical procedures.

Methods: Five widely used chatbots—ChatGPT, Character AI, Google Gemini, You.com, and Perplexity AI—were queried for pre-operative patient information on total thyroidectomy, neck dissection, and parotidectomy. Readability was assessed using four validated indices: Flesch-Kincaid Grade Level, Flesch Reading Ease, Gunning Fog Index, and Simple Measure of Gobbledygook Index. Reliability was evaluated using the DISCERN instrument, a standardized tool for assessing the quality of written patient health information.

Results: AI-generated responses were consistently classified as “difficult” to read, with a mean Flesch-Kincaid Grade Level of 12.0 [95% confidence interval (CI): 11.6–12.3]. Google Gemini produced the most readable text, while ChatGPT had the lowest readability scores. Reliability assessment revealed an average DISCERN score of 45.5 (95% CI: 44.4–46.6), corresponding to “fair” quality. Notably, You.com and Perplexity AI scored highest (>50), largely due to their inclusion of source referencing.

Conclusions: AI chatbots provide readily accessible but variable-quality health information for head and neck surgery. While their reliability is moderate, readability remains a significant barrier. Enhancements in plain language use, transparency, and source referencing are required before these tools can function as high-quality patient education resources.

Keywords: Artificial intelligence (AI); comprehension; health literacy; data quality; reading

Received: 09 June 2025; Accepted: 28 October 2025; Published online: 16 January 2026.

doi: 10.21037/ajo-25-43

Introduction

With the exponential increase in internet access, over 80% of patients now turn to the internet for health-related information (1-3). As many as 98% of parents report using the internet to find health information for their children (4). This shift is partly driven by the convenience of online access and the perceived confidentiality and anonymity it provides (5). The emergence of artificial intelligence (AI) chatbots marks a new platform for healthcare information delivery, offering patients an accessible, one-stop source for asking questions. AI chatbots such as ChatGPT, which alone saw 14 billion visits from November 2022 to August 2023 (6), are increasingly popular for medical inquiries. Some studies suggest chatbot-provided preoperative information can rival that of physicians (7).

However, the internet remains an unregulated resource, making it difficult to ensure the readability and reliability of the information patients encounter (8-11). Previous studies evaluating online patient information websites have consistently found these resources to be inadequate in both readability and reliability (12-15). There are also concerns about the reliability and safety of generative AI in medical contexts (16).

While earlier research has focused on the readability and reliability of traditional patient information websites (12-15), the quality of AI-generated content in these areas remains unclear. To date, there are no studies that evaluated chatbot information in the field of Head and Neck Surgery. Thus, the aim of this study was to assess the readability and reliability of AI chatbot-generated information related to common head and neck surgeries.

Methods

Chatbot selection

A recent online study identified the top 50 most visited AI tools, collectively accounting for over 80% of the AI industry’s traffic between September 2022 and August 2023 (6). From this list, only the AI chatbot category was analysed. Chatbots designed solely for casual conversation or requiring payment were excluded, as the focus was on free-to-use chatbots, which are more widely accessible to the majority of users. The study is reported according to the STROBE reporting checklist (available at https://www.theajo.com/article/view/10.21037/ajo-25-43/rc).

Searched questions

To obtain patient-centred answers relevant to those preparing for surgery, each AI chatbot was asked: “What should I know about before a total thyroidectomy”, “Neck dissection”, or “Parotidectomy”. All searches were conducted in English in July 2024. To minimise the influence of search location and prior queries, a private browsing window was used, cookies were cleared, and the page was refreshed between each search. A total of 15 responses were generated and subsequently copied into Microsoft Word 2021 (Microsoft Corporation, USA) for further analysis.

Assessment of readability

To assess the readability of each answer, four commonly used and validated readability assessment tools were used: The Flesch-Kincaid Grade (FKG) level, Flesch Reading Ease (FRE) score, Simple Measure of Gobbledygook (SMOG) Index, Gunning Fog Index (GFI). The FKG level corresponds to the United States reading grade level with increase in FKG level correlating to a more difficult to read text (17). The FRE score allocates a score of 0–100, with higher score correlating an easier text to read (17). The Microsoft Word 2021 software (Microsoft Corporation, USA) was used to obtain the FKG and FRE scores. The SMOG Index measures text readability by analysing three ten-sentence samples and counting polysyllabic words (18). It calculates the United States reading grade level required to understand the text (18). The GFI estimates the number of formal education years required (range, 0 to 2.0) in order to read and comprehend the information on the first reading (19). A score of six is suitable for sixth graders, around eight for general public readability, and above 17 for graduate-level comprehension. Table 1 summarizes this information. The SMOG Index and Gunning Fog Index (GFI) scores were assessed using the online platform Readable (https://readable.com/), a widely recognized tool for evaluating text readability (20).

Table 1

Comparison of four readability metric scores across included AI chatbots and common head and neck surgical procedures

AI chatbot	Total thyroidectomy	Neck dissection	Parotidectomy	Mean (SD)	95% CI		P value	Interpretation^‡
AI chatbot	Total thyroidectomy	Neck dissection	Parotidectomy	Mean (SD)	Lower	Upper	P value	Interpretation^‡
FKG
ChatGPT	15.2	12.8	14.8	14.3 (1.3)	12.8	15.7	0.016	–
Character AI	14.0	11.7	13.2	13.0 (1.2)	11.6	14.3	0.037	–
Google Gemini	10.1	7.4	10.5	9.3 (1.7)	7.4	11.2	Ref	–
You.com	11.8	12.2	12.2	12.1 (0.2)	11.8	12.3	0.050	–
Perplexity AI	12.0	12.2	9.8	11.3 (1.3)	9.8	12.8	0.182	–
Mean	12.6	11.3	12.1	12.0 (0.7)	11.6	12.3	0.592^†	–
FRE
ChatGPT	24.1	31.3	22.3	25.9 (4.8)	20.5	31.3	0.021	Very difficult
Character AI	32.2	46.2	34.8	37.7 (7.4)	29.3	46.2	0.121	Difficult
Google Gemini	49.0	68.3	45.1	54.1 (12.4)	40.1	68.2	Ref	Difficult
You.com	34.5	41.7	27.2	34.5 (7.3)	26.3	42.7	0.077	Difficult
Perplexity AI	37.2	40.9	46.0	41.4 (4.4)	36.4	46.4	0.169	Difficult
Mean	35.4	45.7	35.1	38.7 (11.6)	32.8	44.6	0.278^†	Difficult
SMOG Index
ChatGPT	16.8	16.4	16.3	16.5 (0.3)	16.2	16.8	<0.001	–
Character AI	15.2	14.9	15.1	15.1 (0.2)	14.9	15.2	0.005	–
Google Gemini	13.4	12.0	12.9	12.7 (0.7)	12.0	13.5	Ref	–
You.com	16.0	15.4	14.3	15.3 (0.9)	14.3	16.2	0.017	–
Perplexity AI	14.6	17.0	12.9	14.8 (2.1)	12.5	17.2	0.169	–
Mean	15.2	15.1	14.3	14.9 (1.54)	14.1	15.7	0.636^†	–
GFI
ChatGPT	19.4	18.7	18.5	18.9 (0.5)	18.3	19.4	0.001	Post-graduate
Character AI	16.9	16.4	16.8	16.7 (0.2)	16.4	17.0	0.005	College
Google Gemini	14.3	12.4	13.7	13.5 (1.0)	12.4	14.6	Ref	College
You.com	18.2	17.2	16.0	17.1 (1.1)	15.8	18.4	0.013	College
Perplexity AI	16.2	19.5	14.2	16.6 (2.7)	13.6	19.7	0.129	College
Mean	17.0	16.8	15.8	16.6 (2.1)	15.5	17.6	0.672^†	College

^†, one-way ANOVA comparing mean readability scores between common head and neck surgical procedures; ^‡, interpretation for FRE and GFI scores (12). AI, artificial intelligence; ANOVA, analysis of variance; CI, confidence interval; FKG, The Flesch-Kincaid Grade; FRE, Flesch Reading Ease score; GFI, Gunning Fog Index; SD, standard deviation; SMOG, Simple Measure of Gobbledygook.

Assessment of reliability

The reliability of each answer was assessed using the DISCERN instrument (21). This is a validated tool with 16 questions rated on a Likert scale from one to five (see Appendix 1). A higher score indicates better quality health information. Two authors (H.J.K. and O.A.) conducted independent evaluations of the information provided by each chatbot, applying the DISCERN criteria for each procedure. When discrepancies arose, they were discussed between the two authors, and if no agreement could be reached, the issue was escalated to an third author (N.L.) who independently applied to DISCERN tool and provided the final DISCERN score.

Statistical analysis

The DISCERN score and readability metrics (FKG, FRE, SMOG Index, and GFI) were documented in Microsoft Excel for subsequent analysis. The mean DISCERN and readability scores (with standard deviations) for each surgery type and each chatbot were compared using one-way analysis of variance (ANOVA) and Student’s t-test. Statistical significance was defined as a p-value of less than 0.05.

Results

Amongst the top 50 most visited AI tools, eight AI chatbots were identified. Three were excluded (two were primarily used for casual conversation, and the third was a paid chatbot) leaving five AI-chatbots to be included in the study. In order of popularity, these were ChatGPT 3.5, Character AI (web interface version, accessed in July 2024), Google Gemini 1.5 Flash, You.com (version 2.0) and Perplexity AI (web interface version, accessed in July 2024).

Readability

The mean FKG score was 12.0 [95% confidence interval (CI): 11.6–12.3] meaning that the readability score was appropriate for at least grade 12 or higher. The readability scores ranged from difficult or very difficult, corresponding to a college or post-graduate education level required to read and comprehend the passage on first attempt. The readability scores are outlined in Table 1. Google Gemini was the easiest chatbot to read across all four metrics used to assess readability, while ChatGPT had the most difficult readability scores. The difference in mean scores between Google Gemini and ChatGPT was statistically significant in all four readability metrics. However, when comparing readability scores between different surgical procedures, no statistically significant difference was observed in any of the four readability metrics as shown in Table 1.

Reliability

The mean DISCERN score across all chatbots was 45.5 (95% CI: 44.4–46.6) indicating a “fair” quality of health information, with scores ranging from 32 to 58. There was a wide range of results, as shown in Table 2, with You.com and Perplexity AI standing out with a mean DISCERN score of 54.7 and 55.0, respectively, reflecting “good” quality information. This is largely due to their use of footnotes and references to information sources. Character Assistant had the lowest score, averaging 33.6, which corresponds to “poor” quality of information. The mean DISCERN scores between different AI chatbots was statistically significant (as shown in Table 2). However, the mean DISCERN score between surgical procedures was not statistically significant (P=0.948).

Table 2

Comparison of DISCERN scores across various AI chatbots and surgical procedures

AI chatbot	Total thyroidectomy	Neck dissection	Parotidectomy	Mean (SD)	95% CI	P value	Interpretation
ChatGPT	46	39	42	42.3 (3.5)	38.4–46.3	0.009	Fair
Character AI	35	34	32	33.7 (1.5)	31.9–35.4	<0.001	Poor
Google Gemini	39	44	42	41.7 (2.5)	38.8–44.5	0.004	Fair
You.com	55	52	57	54.7 (2.5)	51.8–57.5	0.890	Good
Perplexity AI	58	55	52	55.0 (3.0)	51.6–58.4	Ref	Good
Mean	46.6	44.8	45.0	45.5 (1.0)	44.4–46.6	0.9478^†	Fair

^†, one-way ANOVA comparing mean DISCERN scores between common head and neck surgical procedures. ANOVA, analysis of variance; CI, confidence interval; SD, standard deviation.

Discussion

To the authors knowledge, this is the first study that has evaluated the readability and reliability of AI chatbots in the field of Head and Neck Surgery. In this study we have found that the AI chatbots provided overall “fair” quality of health information. However, the readability of all AI chatbots far exceeded the recommended readability levels recommended for patient health information.

This study found that most of AI chatbot information analysed was difficult to read, as shown in Table 1. The mean FKG score was 12.0 (95% CI: 11.6–12.3), indicating that the reader would need at least 12th-grade education level—equivalent to final year of secondary school in New Zealand or Australia—to read and comprehend on the first attempt. Health literacy relies on an individual’s ability to read, understand, and apply information to make informed decisions, give consent, and follow treatment instructions. High-quality patient information fails to achieve effective communication if it is not accessible to the majority of the target audience. The American Medical Association (AMA) and the National Institutes of Health (NIH) recommend that patient materials be written at or below a sixth-grade reading level to maximise accessibility (22). However, studies show that many health materials exceed this standard (13-15). For example, Lee et al. found that patient education materials from websites regarding jaw and orthognathic surgery had a mean FKG readability level of 11.6, with a range of 6 to 16.8 (23). Similarly, Storino et al. reported that pancreatic cancer information on 20 academic hospital websites had a median readability score of 14.5 across nine standardized scoring systems (24). We also analysed the readability scores of the Patient Information Leaflets from the ENT UK website for the three Head and Neck surgeries included in our study (25). Results confirm that pamphlet information had superior readability, with an average FKG score of 9.09 and a mean FRE score of 64.5. Readability was comparable in Australian patient information brochures, with an average FKG score of 9.6 and FRE score of 56.8 (26). The current study has assessed the readability scores amongst AI chatbots. Unlike websites and pamphlets, which are often created by multiple authors, the content provided by AI chatbots is typically generated by a relatively small number of software systems. In our study, Google Gemini demonstrated the highest readability among chatbots, while ChatGPT had the lowest readability scores. When we compared readability scores based on surgical procedures (as opposed to AI chatbots), we found no difference in the readability scores. Therefore, this leads us to conclude that it is the AI chatbots, rather than surgical procedures, that impacted the readability of the content provided. There are several easy ways to improve readability. These include provision of illustrations which can overcome language and numeracy barriers. Another way is to give patients option of “simple” or “comprehensive” information, where patients can select the level of complexity of information they wish to read.

Reliability is a critical factor in evaluating medical information, given its influence on healthcare decisions and the risks that inaccurate information can pose (12-15). Previous studies on traditional online patient information have highlighted similar challenges. For instance, a 2021 study by Ahsanuddin et al. reviewed the top 42 Google search results for “nose job” and found mean DISCERN scores of 2.6 (±0.7), equivalent to a score of 41.6 out of 80, indicating relatively low reliability (27). Similarly, Lee et al. assessed websites related to orthognathic and jaw surgery, finding a mean DISCERN score of 25.5 out of 80, with only two sites achieving DISCERN scores better than 50 (23). On the other hand, reliability from the Patient Information Leaflets from the ENT UK websites scored much higher in our analysis, scoring an average DISCERN score of 65 out of 80.

Our study demonstrated higher DISCERN scores for chatbot-generated content, with a mean of 45.5 (95% CI: 44.4–46.6) out of 80. However, the lack of citations in three out of five chatbots contributed to lower reliability scores, whereas You.com and Perplexity AI achieved DISCERN scores of 54.7 and 55, respectively, due to their inclusion of citations. Similar citation limitations were identified by Hurley and colleagues when analysing ChatGPT information on shoulder stabilisation surgery using the DISCERN score (28).

The DISCERN instrument was generated to assess the quality of patient written information more than 20 years ago, well before mainstream use of AI chatbots (29). Given the widespread use of AI chatbot, it may be time to develop a dedicated AI chatbot reliability tool or a modification of an existing one. Developing a reliability tool tailored to AI-generated information could offer a more accurate framework for assessment, evaluating, and therefore improvement of this emerging information source. An important consideration for any reliability tool assessing AI chatbots is the potential influence of stakeholders, whether direct or indirect. For instance, AI chatbots could be directly funded to present favourable information about a specific product or unfavourable information about a competitor’s product. Indirect influence may also occur, such as when a product gains significant internet visibility through extensive advertising, promotional efforts, or funding of favourable research. This improved internet visibility, in turn, may shape the chatbot’s responses.

Limitations

This study has several limitations. It involves the analysis of top five free to use AI chatbot software and therefore inclusion of other less used, or paid chatbots may have impacted the results of this study. One significant drawback of AI chatbot software is that the source of information is unclear, as no citations are provided. Chatbot responses often draw on expert-level, open-access material rather than patient-focused resources, which may explain the high reading levels observed. The AI chatbots’ responses are based on information available on the internet. Therefore, with time, these responses will change depending on the information available at the time of the search. A further limitation is that each chatbot was queried only once, and responses may vary with repeated prompts, even on the same day. As a result, our findings may not be fully reproducible. Rapid platform evolution also means outputs are likely to change over time, and future studies should consider repeated sampling to account for temporal variability.

In addition, despite two authors assigning DISCERN scores independently, it remains a subjective scoring system. Furthermore, as previously discussed, the DISCERN tool was designed for evaluation of written information (e.g., websites and patient pamphlets) and therefore it has drawbacks when used to assess AI chatbots.

AI chatbots also raise important ethical concerns, including bias, transparency, and conflict of interest. Due to its lack of contextual understanding, AI can produce results that are based on sources that may be funded, distributed or influenced by stakeholders in the field. This is further compounded by AI chatbot responses not stating references from which it has sourced the information.

Regarding readability, we utilised four tools to evaluate it; however, none of these formulas directly assess comprehension (30). Although developed in the United States and not perfectly aligned with Australian or New Zealand grading systems, these readability indices remain the most widely adopted international benchmarks and enable meaningful comparison with prior studies of online health information. The authors recognise that numerous factors influence a patient’s ability to read and understand health information, including formal education, socioeconomic status, language barriers, intellectual ability, and cultural beliefs.

Conclusions

The rise of AI chatbots popularity marks a new frontier in provision and access of healthcare information. The result of this study show there is room for improvement both in readability and reliability of information provided by AI chatbots. With improvement and development of new and dedicated reliability tools, there is an opportunity to provide readable and reliable information to patients. This could help patients make better informed decisions about their health.

Acknowledgments

None.

Footnote

Reporting Checklist: The authors have completed the STROBE reporting checklist. Available at https://www.theajo.com/article/view/10.21037/ajo-25-43/rc

Peer Review File: Available at https://www.theajo.com/article/view/10.21037/ajo-25-43/prf

Data Sharing Statement: Available at https://www.theajo.com/article/view/10.21037/ajo-25-43/dss

Funding: None.

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://www.theajo.com/article/view/10.21037/ajo-25-43/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

InternetNZ. The State of the Internet in New Zealand. Available online: https://internetnz.nz/assets/Archives/State-of-the-Internet-2017.pdf
Fox S. The social life of health information 2011. Available online: https://www.pewresearch.org/2011/05/12/the-social-life-of-health-information-2011/
Fox S. Health Topics 2011. Available online: https://www.pewresearch.org/internet/2011/02/01/health-topics-3/
Pehora C, Gajaria N, Stoute M, et al. Are Parents Getting it Right? A Survey of Parents' Internet Use for Children's Health Care Information. Interact J Med Res 2015;4:e12. [Crossref] [PubMed]
Powell J, Inglis N, Ronnie J, et al. The characteristics and motivations of online health information seekers: cross-sectional survey and qualitative interview study. J Med Internet Res 2011;13:e20. [Crossref] [PubMed]
Sarkar S. AI Industry Analysis: 50 Most Visited AI Tools and Their 24B+ Traffic Behavior. Writerbuddy 2024. Available online: https://writerbuddy.ai/blog/ai-industry-analysis
Durairaj KK, Baker O, Bertossi D, et al. Artificial Intelligence Versus Expert Plastic Surgeon: Comparative Study Shows ChatGPT “Wins” Rhinoplasty Consultations: Should We Be Worried? Facial Plast Surg Aesthet Med 2024;26:270-5. [Crossref] [PubMed]
Fahy E, Hardikar R, Fox A, et al. Quality of patient health information on the Internet: reviewing a complex and evolving landscape. Australas Med J 2014;7:24-8. [Crossref] [PubMed]
Corcelles R, Daigle CR, Talamas HR, et al. Assessment of the quality of Internet information on sleeve gastrectomy. Surg Obes Relat Dis 2015;11:539-44. [Crossref] [PubMed]
Haymes AT. The Quality of Rhinoplasty Health Information on the Internet. Ann Plast Surg 2016;76:143-9. [Crossref] [PubMed]
Alsaiari A, Joury A, Aljuaid M, et al. The Content and Quality of Health Information on the Internet for Patients and Families on Adult Kidney Cancer. J Cancer Educ 2017;32:878-84. [Crossref] [PubMed]
Ahmadi O, Louw J, Leinonen H, et al. Glioblastoma: assessment of the readability and reliability of online information. Br J Neurosurg 2021;35:551-4. [Crossref] [PubMed]
Ahmadi O, Wood AJ. The readability and reliability of online information about adenoidectomy. J Laryngol Otol 2021;135:976-80. [Crossref] [PubMed]
Patel CB, Kerr N, Ahmadi O, et al. Evaluation of readability and reliability of online patient information for intracranial aneurysms. ANZ J Surg 2022;92:843-7. [Crossref] [PubMed]
Heaven CL, Patel C, Ahmadi O, et al. Readability, reliability and credibility of online patient information on skin grafts. Australas J Dermatol 2023;64:e57-64. [Crossref] [PubMed]
Stokel-Walker C, Van Noorden R. What ChatGPT and generative AI mean for science. Nature 2023;614:214-6. [Crossref] [PubMed]
FLESCH R.. A new readability yardstick. J Appl Psychol 1948;32:221-33. [Crossref] [PubMed]
Mc Laughlin GH. SMOG grading-a new readability formula. J Reading 1969;12:639-46.
Gunning R. The technique of clear writing. McGraw-Hill; 1952.
Readable. Reviews. 2024 Dec 5. Available online: https://readable.com/reviews/
Charnock D, Shepperd S, Needham G, et al. DISCERN: an instrument for judging the quality of written consumer health information on treatment choices. J Epidemiol Community Health 1999;53:105-11. [Crossref] [PubMed]
Weiss BD. Health literacy and patient safety: Help patients understand. Manual for Clinicians. 2nd edition. AMA Foundation; 2007.
Lee KC, Berg ET, Jazayeri HE, et al. Online Patient Education Materials for Orthognathic Surgery Fail to Meet Readability and Quality Standards. J Oral Maxillofac Surg 2019;77:180.e1-8. [Crossref] [PubMed]
Storino A, Castillo-Angeles M, Watkins AA, et al. Assessing the Accuracy and Readability of Online Health Information for Patients With Pancreatic Cancer. JAMA Surg 2016;151:831-7. [Crossref] [PubMed]
Patient information leaflets [Internet]. London: ENT UK. [cited 2025 Feb 7]. Available online: https://www.entuk.org/professionals/patient_information_leaflets.aspx
Patient brochures [Internet]. Sydney: Head and Neck Cancer Australia. [cited 2025 Feb 27]. Available online: https://support.headandneckcancer.org.au/store/products/21/patient-brochures
Ahsanuddin S, Cadwell JB, Povolotskiy R, et al. Quality, Reliability, and Readability of Online Information on Rhinoplasty. J Craniofac Surg 2021;32:2019-23. [Crossref] [PubMed]
Hurley ET, Crook BS, Lorentz SG, et al. Evaluation High-Quality of Information from ChatGPT (Artificial Intelligence-Large Language Model) Artificial Intelligence on Shoulder Stabilization Surgery. Arthroscopy 2024;40:726-731.e6. [Crossref] [PubMed]
Rees CE, Ford JE, Sheard CE. Evaluating the reliability of DISCERN: a tool for assessing the quality of written patient information on treatment choices. Patient Educ Couns 2002;47:273-5. [Crossref] [PubMed]
Kauchak D, Leroy G. Moving Beyond Readability Metrics for Health-Related Text Simplification. IT Prof 2016;18:45-51. [Crossref] [PubMed]

doi: 10.21037/ajo-25-43
Cite this article as: Kwun HJ, Lilic N, Ahmadi O. Assessment of artificial intelligence chatbot generated patient information on head and neck surgery. Aust J Otolaryngol 2026;9:8.

Assessment of artificial intelligence chatbot generated patient information on head and neck surgery

Introduction

Methods

Chatbot selection

Searched questions

Assessment of readability

Table 1

Assessment of reliability

Statistical analysis

Results

Readability

Reliability

Table 2

Discussion

Limitations

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share