Paediatric tonsillitis: evaluation of artificial intelligence generated responses to commonly asked questions

Surya Singh; Femi E. Ayeni; Niranjan Sritharan; Faruque Riffat; Anand Suruliraj; Suchitra Paramaesvaran

doi:10.21037/ajo-25-48

Original Article

Paediatric tonsillitis: evaluation of artificial intelligence generated responses to commonly asked questions

Surya Singh¹ , Femi E. Ayeni², Niranjan Sritharan¹, Faruque Riffat¹, Anand Suruliraj¹, Suchitra Paramaesvaran¹

¹Division of Otolaryngology and Head & Neck Surgery, Department of Surgery, Nepean Hospital, Sydney, NSW, Australia; ²Nepean Institute of Academic Surgery, Nepean Hospital Clinical School, Department of Surgery, University of Sydney, NSW, Australia

Contributions: (I) Conception and design: S Singh, S Paramaesvaran; (II) Administrative support: S Paramaesvaran, FE Ayeni; (III) Provision of study materials or patients: S Singh; (IV) Collection and assembly of data: N Sritharan, F Riffat, A Suruliraj, S Paramaesvaran; (V) Data analysis and interpretation: S Singh, S Paramaesvaran, FE Ayeni; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Dr. Surya Singh, BSc, MBChB. Division of Otolaryngology and Head & Neck Surgery, Department of Surgery, Nepean Hospital, Somerset St, Kingswood, Sydney, NSW 2747, Australia. Email: suryausingh@gmail.com.

Background: Artificial intelligence (AI) is becoming increasingly prevalent as a medical tool for patients and families. Research is needed to investigate the accuracy of responses to medical questions and vignettes. This study aims to investigate the performance of Chat Generative Pre-trained Transformer (ChatGPT) and Google’s Gemini in paediatric tonsillitis vignettes, with direct comparison to recognised international guidelines and patient-centred online resources for accuracy and quality.

Methods: Five commonly asked questions about paediatric tonsillitis were selected by four Ear, Nose and Throat (ENT) surgeons with over 40 years of combined experience. These questions were posed to ChatGPT and Gemini. Patient resources and medical guidelines were also searched for relevant answers to these questions. Four assessors collected sufficient data to achieve over 80% power to detect a medium effect size (f=0.25). Answers were ranked using the Appropriateness of Preclinical Measures (APM) score, an established grading tool for AI generated responses. Readability scores were calculated for each answer to determine ease of interpretation using the Flesch Reading Ease Score (FRES) and the Flesch-Kincaid Grade Level (FKGL).

Results: Both AI models answered with high accuracy. Mean APM scores for ChatGPT (no prompt) and Gemini were 3.90±0.31 and 3.15±1.60, respectively, out of a maximum of 4.0. Analysis showed neither was superior (P=0.33). When ChatGPT acted as an experienced ENT surgeon, the mean APM score increased to 4.0±0.00. This score was significantly higher compared to the American Academy of Otolaryngology-Head and Neck Surgery (AAO-HNS) Tonsillectomy in Children guidelines and the Royal Australian College of General Practitioners (RACGP) tonsillitis guidelines [both with equivalent mean APM scores 3.10±1.70; 95% confidence interval (CI): 2.34–3.89; P<0.02]. No statistical significance was found when compared to ENT UK guidelines (mean APM score 3.80±0.92; P=0.23). There was no significant difference between both versions of ChatGPT (P=0.99). Readability varied across groups [FRES 46.3–61.8 (±3.50–13.5)]. Patient facing resources were most readable (FRES ≈61; FKGL 8.60–8.70; grade 8–9 level). The AAO-HNS guidelines showed lowest readability (FRES 46.3±9.29; FKGL ≈10.8). All AI outputs exceeded recommended complexity levels.

Conclusions: This study demonstrates the potential for AI to aid the public. These results suggest that for basic patient-centred questions, AI systems may be equivalent to guidelines or online resources. However, responses may lack the depth required for thorough understanding, and quality may be affected by user input. AI shows great promise in aiding medical education, triage and pre-hospital management. However, the quality and accuracy of AI answers need improvement before it can be trusted as a primary information resource for patients. We continue to recommend consultation with general practitioners for concerns of tonsillitis, and review with an ENT surgeon if indicated.

Keywords: Otorhinolaryngology; Chat Generative Pre-trained Transformer (ChatGPT); artificial intelligence (AI); paediatric; patient education

Received: 17 June 2025; Accepted: 08 January 2026; Published online: 16 March 2026.

doi: 10.21037/ajo-25-48

Introduction

Paediatric tonsillitis is one of the most common presentations that is treated by general practitioners and Ear, Nose and Throat (ENT) surgeons. In 2002, it was estimated that paediatric tonsillitis represented 3.7% of all consultations in Australia (1). It can cause significant anxiety for caregivers or disruption to families, with potential time off work to attend appointments or need for hospitalisation. As a result, parents may find it more effective to access resources on the internet for education or information regarding timely presentation (2). It is estimated that 80% of Australians search online for health information, and 40% seek self-treatment advice (3). It is particularly important to consider the variability in ENT-focused publicly available patient-centred health resources and evaluate these for accuracy and relevance.

The availability and use of artificial intelligence (AI) has flourished, with an increasing number of publicly available applications. Two examples include OpenAI’s Chat Generative Pre-trained Transformer (ChatGPT) and Google’s Gemini (previously Bard), each with over 100 million monthly users (4-6). ChatGPT has demonstrated almost human-level performance on medical examinations such as the United States Medical Licensing Exam (USMLE), scoring near the passing mark of 60% (7). Additionally, there is growing evidence demonstrating the capabilities of AI in more clinical settings, with AI models performing strongly in clinical exams for pathology, neurosurgery and radiology (8-10).

Advancement in the capabilities of AI has the potential to revolutionise health care. Patients can use AI to provide personalised support in a more interactive manner that increases trust (11). AI can empower patients to take control of their health through aiding with simple diagnoses, suggesting over-the-counter medication or avoiding unnecessary hospital visits (12). With primary care access becoming a growing issue in Australia, this is especially important in rural and remote populations (13).

However, it is important to acknowledge that use of AI powered tools may pose issues as responses can contain information that is overly generalised, inaccurate, or false (14). This could lead to missed diagnoses and risks significant morbidity and mortality. Search engines such as Google are still the most popular tools for the public to search for information due to ease of use and to enable sources to be vetted (15). In the field of ENT surgery, in-depth publicly intelligible resources are limited. This may be due to the broad nature of the specialty as well as a lack of consensus amongst international guidelines. It is critical to evaluate these AI tools to ensure advice is evidence based, promoting patient safety and optimal health outcomes (8,16). Data that singularly explores the role of AI in ENT surgery remains limited. This study aims to investigate the accuracy of information provided by AI in response to a paediatric tonsillitis vignette and compare the results against established validated guidelines.

Methods

The study is reported according to the STROBE reporting guidelines (available at https://www.theajo.com/article/view/10.21037/ajo-25-48/rc). Ethical approval was assessed as not required given this study did not involve human participation, and the data collected was freely available from a public domain.

Question design

Five commonly asked questions by parents/caregivers of children with tonsillitis (up to age 18 years) were curated. The questions were informed by the clinical experience of four practising ENT surgeons. Each surgeon was a Fellow of the Royal Australian College of Surgeons (FRACS) and had practised for more than ten years. Four assessors were required to produce a sufficient data set to generate >80% power for a medium effect size (f=0.25).

Each AI model was given a prompt (Table 1) and asked five clinical questions (Table 2). Responses were recorded and the interaction repeated three times to ensure consistency of response. Browser cookies and history were cleared between each repeat session to negate the effect of carryover memory. ChatGPT was then prompted to act as a specialist ENT surgeon prior to answering the questions. This was done to determine whether the quality of response varied when a specialised role was assigned. ChatGPT 3.5 (default free version) was utilised between 23rd and 24th August 2024 and Google Gemini on 28th August 2024.

Table 1

Prompts given to AI models prior to questions being asked

AI tool	Prompt
ChatGPT/Gemini	“Please answer the following questions about paediatric tonsillitis (age ≤18)”
ChatGPT (ENT surgeon)	“Act as an ENT surgeon with over 10 years of experience. Answer the following questions about paediatric tonsillitis (age ≤18)”

AI, artificial intelligence; ENT, Ear, Nose and Throat.

Table 2

Summary of the questions posed to both AI models

No.	Question
1	What is the definition of tonsillitis?
2	Define and explain the differences between tonsillitis and pharyngitis
3	Do either tonsillitis or pharyngitis need treatment with antibiotics?
4	What are the red flags that require urgent care?
5	When are children referred to ENT specialists and are there any modifying criteria?

AI, artificial intelligence; ENT, Ear, Nose and Throat.

Information sourcing

Responses to the same questions were collected from reputable internet-based clinical sources for comparison. These were the American Academy of Otolaryngology-Head and Neck Surgery (AAO-HNS) guidelines for tonsillectomy in children, the United Kingdom’s Royal College of Surgeons’ tonsillectomy guidelines (ENT UK), and the Royal Australian College of General Practitioners (RACGP) tonsillitis guidelines (17-20). Comparative patient-centred online resources such as HealthDirect and the Mayo Clinic were also utilised (21,22). HealthDirect was deemed a credible patient resource, given its endorsement from the Australian Government and the Mayo Clinic is a well-respected patient resource from the United States of America (USA) (23). The five clinical questions were investigated across eight resources, totalling 40 data points.

Scoring

The quality of response was judged using the Appropriateness of Preclinical Measures (APM) score, an established grading tool for AI generated responses (16). Each of the four assessors assigned an APM score to each AI response as per the outlined scale (Table 3). Each clinical guideline and patient resource was also scored according to the APM scale for comparison. To maintain blinding, all responses were reformatted into de-identified plain-text and the order randomised.

Table 3

The APM grading criteria

Score	Description
0	Contains harmful advice or harmfully lacks crucial preclinical measures
1	Contains conflicting advice
2	Contains only useless advice
3	Contains useless as well as appropriate advice
4	Contains only appropriate advice

Adapted from: Knebel D, Priglinger S, Scherer N, et al. Assessment of ChatGPT in the Prehospital Management of Ophthalmological Emergencies - An Analysis of 10 Fictional Case Vignettes. Klin Monbl Augenheilkd 2024;241:675-81 (16). APM, Appropriateness of Preclinical Measures.

Readability score

AI output and sourced information was analysed for readability through two commonly used scoring indices, the Flesch Reading Ease Score (FRES) and the Flesch-Kincaid Grade Level (FKGL) (24,25). FRES is calculated using the formula 206.835 − (1.015 × average sentence length) − (84.6 × average number of syllables per word). FKGL is a modified version of the FRES scale and is approximated as FKGL ≈ 0.39 × (206.835 − FRES)/10.6. FKGL and FRES are calculable via online calculators, which are reliable and valid (25). FKGL yields a score corresponding to a USA grade level, where the score indicates the grade-level of education typically required to understand the text (Table 4).

Table 4

Readability score/index and equivalent grade level

Readability	Equivalent US grade level	FRES	FKGL
Extremely easy	4th or below	>100	0
Very easy	5th	90–100	1–5
Easy	6th	80–89	6
Fairly easy	7th	70–79	7
Standard	8th–9th	60–69	6–8
Fairly difficult	10th–12th	50–59	9–14
Difficult	College	30–49	14
Very confusing	Above college	0–29	≥15

FKGL, Flesch-Kincaid Grade Level; FRES, Flesch Reading Ease Score; US, United States.

Statistical analysis

Statistical analysis was performed in GraphPad Prism 10. The Mann-Whitney U test and Kruskal-Wallis test were used for comparison of differences of means in the unpaired groups as the data is non-parametric. All values from study groups are combined and ranked from the lowest to the highest. GraphPad Prism calculates the mean rank score for each group to determine if there is a statistically significant difference between groups, even if means are the same. Statistical significance was accepted at P<0.05.

Results

All sources

Across the five questions (Table 2), the guidelines and AI models varied in the quality and accuracy of their responses as graded by the APM scoring tool. The Kruskal-Wallis test was used for comparison among all groups as the data is non-parametric, allowing for comparison of mean rank. Mean rank scores among the groups are summarised in Table 5.

Table 5

Summary of mean APM and mean rank scores of all sources

Source	Mean APM score	SD	95% CI	Mean rank score
ChatGPT (no prompt)	3.90	0.31	3.76–4.04	97.8
ChatGPT (ENT surgeon)	4.00	0.00	4.00–4.00	106.0
Google Gemini	3.15	1.63	2.39–3.91	82.2
HealthDirect	3.65	0.93	3.21–4.01	88.8
Mayo Clinic	3.70	0.92	3.27–4.13	92.9
AAO-HNS	3.10	1.62	2.34–3.86	78.1
ENT UK	3.80	0.52	3.56–4.05	93.3
RACGP	3.10	1.62	2.34–3.86	78.1

AAO-HNS, American Academy of Otolaryngology-Head and Neck Surgery; APM, Appropriateness of Preclinical Measures; CI, confidence interval; ENT, Ear, Nose and Throat; RACGP, Royal Australian College of General Practitioners; SD, standard deviation.

ChatGPT vs. Google Gemini

ChatGPT (no prompt) and Gemini had mean APM scores of 3.90±0.31 and 3.15±1.63 respectively. There was no statistical significance between groups using the Mann-Whitney U test (P=0.33).

Scores improved once ChatGPT was prompted to answer assuming the role of an ENT surgeon. ChatGPT (ENT surgeon) scored significantly higher when compared to Gemini [mean APM score 4.00±0.00 vs. 3.15±1.63; mean difference 0.75; 95% confidence interval (CI): 2.39–3.91; effect size 2.50; P=0.03]. No statistical difference was found when ChatGPT (no prompt) was compared to ChatGPT (ENT surgeon) (P=0.99).

AI vs. guidelines and patient-centred online resources

Each AI source was compared with non-AI sources using a Mann-Whitney U test (Figure 1). No statistical significance was found comparing ChatGPT (no prompt) to the AAO-HNS guidelines (mean APM score 3.90±0.31 vs. 3.10±1.62; P=0.08) or Gemini against the AAO-HNS guidelines (mean APM score 3.15±1.63 vs. 3.10±1.62; P=0.94).

Figure 1 Overview of mean APM scores for all sources with statistical significance in comparison between ChatGPT (ENT surgeon) and Gemini (AI source), and AAO-HNS and RACGP clinical guidelines (non-AI source). Error bars indicate standard deviation. *, statistical significance P<0.05. AAO-HNS, American Academy of Otolaryngology-Head and Neck Surgery; AI, artificial intelligence; APM, Appropriateness of Preclinical Measures; ENT, Ear, Nose and Throat; RACGP, Royal Australian College of General Practitioners.

ChatGPT (ENT surgeon) scored significantly higher than the AAO-HNS guidelines as well as the RACGP guidelines (mean APM score 4.00±0.00 vs. 3.10±1.62; mean difference 0.85; 95% CI: 2.34–3.86; P<0.02). When compared to ENT UK guidelines, ChatGPT (ENT surgeon) showed no statistical difference (P=0.23). No statistical significance was found comparing any other sources. These results are summarised in Table 6.

Table 6

Mann-Whitney U test results of ChatGPT (ENT surgeon) vs. non-AI sources

Source	Mean APM score	SD	Mean rank score	Mean difference	95% CI	P value
ChatGPT (ENT surgeon)	4.00	0.00	106.0	0.00	4.00–4.00	–
HealthDirect	3.65	0.93	88.8	0.35	3.21–4.09	0.11
Mayo Clinic	3.70	0.92	92.9	0.30	3.27–4.13	0.23
AAO-HNS	3.10	1.62	78.1	0.90	2.34–3.89	0.02*
ENT UK	3.80	0.52	93.3	0.20	3.56–4.05	0.23
RACGP	3.10	1.62	78.1	0.90	2.34–3.89	0.02*

*, statistical significance P<0.05. AAO-HNS, American Academy of Otolaryngology-Head and Neck Surgery; AI, artificial intelligence; APM, Appropriateness of Preclinical Measures; CI, confidence interval; ENT, Ear, Nose and Throat; RACGP, Royal Australian College of General Practitioners; SD, standard deviation.

Readability scoring

The overall readability of the eight sources varied moderately, with mean FRES scores ranging from 46.3 to 61.8 (±3.5–13.5) (Table 7). HealthDirect and Mayo Clinic demonstrated the highest readability (FRES 61.8±3.50 and 61.2±6.23, respectively), corresponding to an approximate FKGL of 8.00–8.10, which aligns with a grade 8–9 reading level (Table 4).

Table 7

Mean FRES and FKGL scores of all sources

Source	Mean FRES score	SD	FKGL
ChatGPT (no prompt)	53.4	7.78	9.60
ChatGPT (ENT surgeon)	53.0	11.7	9.70
Google Gemini	58.1	13.5	8.30
HealthDirect	61.8	3.50	8.00
Mayo Clinic	61.2	6.23	8.10
AAO-HNS	46.3	9.29	10.8
ENT UK	53.7	12.6	9.50
RACGP	56.3	12.9	9.00

AAO-HNS, American Academy of Otolaryngology-Head and Neck Surgery; FKGL, Flesch-Kincaid Grade Level; FRES, Flesch Reading Ease Score; RACGP, Royal Australian College of General Practitioners; SD, standard deviation.

In contrast, the AAO-HNS guidelines had the lowest readability (FRES 46.3±9.29; FKGL 10.8), representing more complex, upper-secondary reading difficulty. The remaining groups scored within a mid-range band (FRES 53–58; FKGL 8–10).

Discussion

Quality of response

The use of large language models (LLMs) is widespread and growing. This study sought to evaluate two of the most widely adopted AI platforms in their ability to provide safe and appropriate advice. The vignettes were designed to reflect typical queries that parents or caregivers might direct to their GP, with the answers scored based on the experience of several ENT surgeons.

Mean APM scores for ChatGPT (no prompt) and Google Gemini were 3.90±0.31 and 3.15±1.63 respectively, indicating that both AI tools generated clinically relevant and useful information. The responses from AI were comparable in quality to those from established, trusted health resources such as HealthDirect or the Mayo Clinic in addition to well established surgical guidelines as shown in Figure 1 and Table 5. No statistical difference was found between the scores of ChatGPT (no prompt) to Gemini; nor was there a statistical difference between either AI model to any other established source of interest. This may suggest that both models rely on similar information sources, which could explain the comparable performance and consistency with guideline-based recommendations. This could be due to AI tools having a vast number of users, which over time, trains them to answer questions more accurately (26).

Both AI models answered appropriately in all cases. However, there were discrepancies in the level of detail provided and clarity. Limitations were found in question 5, especially with Gemini struggling to provide reference to the main red flags of tonsillitis or frequency and longevity of episodes of tonsillitis that would validate the option for tonsillectomy (Table S1) (17). Both ChatGPT (no prompt) and Google’s Gemini also occasionally struggled to give a fully comprehensive answer. This illustrates the current limitations of AI tools, which may make interpretation harder for users. This may result in AI tools offering incomplete medical advice and thus professional opinion should always be sought. This study used the freely available ChatGPT 3.5. Further research will be invaluable to assess the enhanced capabilities of future generations of AI for potentially deeper and more nuanced responses (27).

Enhancing AI with prompt engineering

The concept of prompt engineering is the practice of implementing refined prompts to guide and maximise the output from LLMs and is well described in literature (28). We investigated how ChatGPT’s performance changed from baseline once we pre-prompted it to answer the questions as an experienced ENT specialist. The study found that initially neither ChatGPT (no prompt) nor Gemini scored significantly higher than the AAO-HNS guidelines. However, once the specialised role was assigned to ChatGPT, scores significantly improved compared to Gemini indicating the importance of specific input when asking questions of AI models.

When this higher score was compared to the guidelines of interest, ChatGPT (ENT surgeon) scored significantly higher than both the AAO-HNS guidelines and RACGP tonsillitis guidelines. However, no statistical significance was found when comparing to ENT UK guidelines (P=0.23). By prompting ChatGPT to take on a role prior to asking the clinical questions, it was now likely viewing these prompts via the lens of a subspecialised clinician. Thus, the level of detail provided was much higher. This highlights how specific input significantly improves the quality and accuracy of AI outputs, with more detailed prompts typically leading to better results (28,29).

Evaluation of non-AI sources

The three clinical practice guidelines and patient-centred online resources all scored a mean APM score >3 indicating that relevant information was provided and may be comparable to AI sources (17-22). It is important to note that, on occasion, assessors found the responses given by patient-centred online resources to be unsatisfactory, scoring 2 or lower on the APM scale (Table S1). Not all guidelines addressed the clinical questions precisely, which limited the specificity of responses. Additionally, patient-centred resources offer generalised information, and may not always be applicable to an individual’s health status at a given time. The ability to tailor specific queries allows AI to generate more personalised responses, enabling patients to seek clarification beyond the static information available on traditional platforms.

One factor that must be considered is that an assessor’s standard of quality, and thus the awarded APM score, may be affected by any preceding responses. It is unavoidable that comparisons could be drawn between the responses given by different sources, and this may have affected an assessor’s judgement. The order of responses was randomised across each question to minimise any potential effect this may have.

Overall, this study suggests that AI models should not be used as a standalone resource to be trusted implicitly. While it can synthesise and present complex medical information in an approachable manner, it is not always accurate. With continued refinement, AI is expected to improve in its clinical reasoning and fidelity. Continued evaluation of AI’s role in patient education is crucial, and further studies will be necessary to validate the APM scoring framework for patient education.

Ease of readability

The Australian Commission on Safety and Quality in Health Care (ACSQHC) recommends that patient education resources must have a readability at the grade 8–9 reading level, thus ensuring adequate legibility for the majority of adults (30).

Readability analysis revealed variations in text complexity amongst groups. Although several groups approached the recommended readability threshold for public health communication (FRES ≥60; FKGL ≤9), only HealthDirect and the Mayo Clinic met this criterion (30). This result was anticipated, given that these government or patient-facing resources are intentionally designed to be intelligible for the general public. The AAO-HNS guidelines scored the lowest, likely a consequence of the technical language used in clinician-focused guideline material (FRES 46.3±9.29; FKGL 10.8). All AI models had an FGKL score between 8–10 indicating a grade 10–12 reading level indicating that this information was considered moderately complex. This is suboptimal given the widespread availability of AI-generated content to the public, where excessive complexity may impair comprehension and risk misinterpretation.

Limitations

A key limitation of this study is the black box problem (31). This refers to the challenge of understanding how AI models arrive at their decisions. This lack of transparency raises concerns about trust and accountability as we cannot know where AI sources information from and thus could present outdated data to patients. This poses a critical problem for healthcare where accurate, evidence-based information is essential. These challenges highlight the need for careful use, thorough validation, and continued research into more transparent systems.

It is important to highlight that the APM scoring tool lacks formal validation studies. It is an emerging tool with limited use in clinical research. This may explain the wide confidence intervals seen in the data set.

Another limitation of AI models is the variability in their responses. The phrasing of a question may not align with how a patient would present their concerns, nor does it ensure a consistent output from the model. In contrast, online resources and databases provide more standardised information. AI-generated responses can also fluctuate based on prior interactions and even geo-location, making repeatability and consistency unreliable (32). Additionally, the language used by AI may not reflect typical layperson language or be more complex and thus readability and intelligibility may be affected.

Clinical assessments and management inherently vary among practitioners. Conclusions are influenced by their individual experience and scope of practice, particularly in highly specialised fields such as ENT surgery. Consequently, findings within this study may have limited applicability to broader surgical or medical disciplines. Interestingly, when one assessor repeated the survey, it was noted their new APM scores varied from baseline. This result was not included in the data analysis but does highlight variability and potential test-retest bias.

Regulation of AI in Australia

In Australia, AI LLMs and generative AI are not considered medical devices by the Therapeutic Goods Administration (TGA) as they are not intended by the manufacturer for diagnosis, monitoring or treatment of a disease or condition (33). If a member of the general public uses AI for diagnostic purposes, the product is being used outside the scope of its intended purpose and by definition, it is not regulated as a medical device. Its use in a clinical context falls outside regulatory oversight and renders any advice provided as informational rather than diagnostic. The position of the TGA aligns with international standards set by the International Medical Device Regulators Forum (IMDRF) and comparable frameworks from the U.S. Food and Drug Administration (FDA) and European Union Medical Device Regulation (EU MDR) (34,35). Clinicians and members of the general public must exercise caution when interpreting AI generated outputs. They must recognise and understand that these tools lack formal validation, and no mechanism is present to track and correct misinformation delivered to consumers.

Conclusions

This study highlights the potential use for AI models to provide high quality information, at the level of verified online resources. Our results demonstrate that ChatGPT, with prompt engineering may provide superior advice over other AI models, however there is a high degree of variability and reliance on user expertise. Further appraisal is needed before AI can be trusted as a primary information resource for patients. We would recommend consultation with general practitioners for concerns of tonsillitis, and ENT review if indicated.

Acknowledgments

The authors would like to thank Dr. Timothy Do for his invaluable assistance with manuscript preparation.

Footnote

Reporting Checklist: The authors have completed the STROBE reporting checklist. Available at https://www.theajo.com/article/view/10.21037/ajo-25-48/rc

Data Sharing Statement: Available at https://www.theajo.com/article/view/10.21037/ajo-25-48/dss

Peer Review File: Available at https://www.theajo.com/article/view/10.21037/ajo-25-48/prf

Funding: None.

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://www.theajo.com/article/view/10.21037/ajo-25-48/coif). F.R. serves as an unpaid editorial board member of Australian Journal of Otolaryngology from January 2025 to December 2027. The other authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. Ethical approval was assessed as not required given this study did not involve human participation, and the data collected was freelyavailable from a public domain.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Hibbert P, Stephens JH, de Wet C, et al. Assessing the Quality of the Management of Tonsillitis among Australian Children: A Population-Based Sample Survey. Otolaryngol Head Neck Surg 2019;160:137-44. [Crossref] [PubMed]
Kubb C, Foran HM. Online Health Information Seeking by Parents for Their Children: Systematic Review and Agenda for Further Research. J Med Internet Res 2020;22:e19985. [Crossref] [PubMed]
Hill MG, Sim M, Mills B. The quality of diagnosis and triage advice provided by free online symptom checkers and apps in Australia. Med J Aust 2020;212:514-9. [Crossref] [PubMed]
Aaronson NL, Joshua CL, Boss EF. Health literacy in pediatric otolaryngology: A scoping review. Int J Pediatr Otorhinolaryngol 2018;113:252-9. [Crossref] [PubMed]
Tan L, Tivey D, Kopunic H, et al. Part 1: Artificial intelligence technology in surgery. ANZ J Surg 2020;90:2409-14. [Crossref] [PubMed]
Bur AM, Shew M, New J. Artificial Intelligence for the Otolaryngologist: A State of the Art Review. Otolaryngol Head Neck Surg 2019;160:603-11. [Crossref] [PubMed]
Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health 2023;2:e0000198. [Crossref] [PubMed]
Shelmerdine SC, Martin H, Shirodkar K, et al. Can artificial intelligence pass the Fellowship of the Royal College of Radiologists examination? Multi-reader diagnostic accuracy study. BMJ 2022;379:e072826. [Crossref] [PubMed]
Ali R, Tang OY, Connolly ID, et al. Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations. Neurosurgery 2023;93:1353-65. [Crossref] [PubMed]
Sinha RK, Deb Roy A, Kumar N, et al. Applicability of ChatGPT in Assisting to Solve Higher Order Problems in Pathology. Cureus 2023;15:e35237. [Crossref] [PubMed]
Steerling E, Siira E, Nilsen P, et al. Implementing AI in healthcare-the relevance of trust: a scoping review. Front Health Serv 2023;3:1211150. [Crossref] [PubMed]
Bibault JE, Chaix B, Guillemassé A, et al. A Chatbot Versus Physicians to Provide Information for Patients With Breast Cancer: Blind, Randomized Controlled Noninferiority Trial. J Med Internet Res 2019;21:e15787. [Crossref] [PubMed]
Mothershaw A, Smith AC, Perry CF, et al. Does artificial intelligence have a role in telehealth screening of ear disease in Indigenous children in Australia? Aust J Otolaryngol 2021;4.
Haupt CE, Marks M. AI-Generated Medical Advice-GPT and Beyond. JAMA 2023;329:1349-50. [Crossref] [PubMed]
Bachl M, Link E, Mangold F, et al. Search Engine Use for Health-Related Purposes: Behavioral Data on Online Health Information-Seeking in Germany. Health Commun 2024;39:1651-64. [Crossref] [PubMed]
Knebel D, Priglinger S, Scherer N, et al. Assessment of ChatGPT in the Prehospital Management of Ophthalmological Emergencies - An Analysis of 10 Fictional Case Vignettes. Klin Monbl Augenheilkd 2024;241:675-81. [Crossref] [PubMed]
Mitchell RB, Archer SM, Ishman SL, et al. Clinical Practice Guideline: Tonsillectomy in Children (Update). Otolaryngol Head Neck Surg 2019;160:S1-S42. [Crossref] [PubMed]
Commissioning guide 2020 TONSILLECTOMY. ENT UK. 2021. [cited 2025 Feb 13]. Available online: https://www.entuk.org/_userfiles/pages/files/guidelines/Revised%20ENT%20UK%20Tonsillectomy%20commissioning%20guide%20edit%20to%20final%20(002).pdf
Tonsillitis. nhs.uk [Internet]. 2017 [cited 2025 Apr 13]. Available online: https://www.nhs.uk/conditions/tonsillitis/
Patel C, Green BD, Batt JM, et al. Antibiotic prescribing for tonsillopharyngitis in a general practice setting: Can the use of Modified Centor Criteria reduce antibiotic prescribing? Aust J Gen Pract 2019;48:395-401. [Crossref] [PubMed]
Tonsillitis. healthdirect. 2025. [cited 2025 Apr 13]. Available online: https://www.healthdirect.gov.au/tonsillitis
Tonsillitis. Symptoms & causes. 2025. [cited 2025 Apr 13]. Available online: https://www.mayoclinic.org/diseases-conditions/tonsillitis/symptoms-causes/syc-20378479
Our shareholders. Healthdirect Australia. 2024. [cited 2025 Apr 13]. Available online: https://about.healthdirect.gov.au/shareholders
FLESCH R.. A new readability yardstick. J Appl Psychol 1948;32:221-33. [Crossref] [PubMed]
Kincaid JP, Fishburne Jr, Robert P, et al. Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel. 1975. [cited 2025 Nov 7]. Available online: https://apps.dtic.mil/sti/citations/tr/ADA006655
Fattah FH, Salih AM, Salih AM, et al. Comparative analysis of ChatGPT and Gemini (Bard) in medical inquiry: a scoping review. Front Digit Health 2025;7:1482712. [Crossref] [PubMed]
Funk PF, Hoch CC, Knoedler S, et al. ChatGPT's Response Consistency: A Study on Repeated Queries of Medical Examination Questions. Eur J Investig Health Psychol Educ 2024;14:657-668. [Crossref] [PubMed]
Meskó B. Prompt Engineering as an Important Emerging Skill for Medical Professionals: Tutorial. J Med Internet Res 2023;25:e50638. [Crossref] [PubMed]
Koçak B, Cuocolo R, dos Santos DP, et al. Must-have Qualities of Clinical Research on Artificial Intelligence and Machine Learning. Balkan Med J 2023;40:3-12. [Crossref] [PubMed]
Health Literacy - Taking action to improve safety and quality. Australian Commission on Safety and Quality in Health Care. 2014. [cited 2025 Nov 7]. Available online: https://www.safetyandquality.gov.au/publications-and-resources/resource-library/health-literacy-taking-action-improve-safety-and-quality
Castelvecchi D. Can we open the black box of AI? Nature 2016;538:20-3. [Crossref] [PubMed]
Gumilar KE, Indraprasta BR, Hsu YC, et al. Disparities in medical recommendations from AI-based chatbots across different countries/regions. Sci Rep 2024;14:17052. [Crossref] [PubMed]
Therapeutic Goods (Medical Devices) Regulations 2002. Therapeutic Goods Administration. 2024. [cited 2025 Oct 8]. Available online: https://www.tga.gov.au/resources/legislation/therapeutic-goods-medical-devices-regulations-2002
Artificial Intelligence in Software as a Medical Device. U.S FOOD & DRUG. 2025. [cited 2025 Oct 8]. Available online: https://www.fda.gov/medical-devices/software-medical-device-samd/artificial-intelligence-software-medical-device
Software as a Medical Device (SaMD): Clinical Evaluation. International Medical Device Regulators Forum. 2017. [cited 2025 Oct 8]. Available online: https://www.imdrf.org/documents/software-medical-device-samd-clinical-evaluation

doi: 10.21037/ajo-25-48
Cite this article as: Singh S, Ayeni FE, Sritharan N, Riffat F, Suruliraj A, Paramaesvaran S. Paediatric tonsillitis: evaluation of artificial intelligence generated responses to commonly asked questions. Aust J Otolaryngol 2026;9:21.

Paediatric tonsillitis: evaluation of artificial intelligence generated responses to commonly asked questions

Introduction

Methods

Question design

Table 1

Table 2

Information sourcing

Scoring

Table 3

Readability score

Table 4

Statistical analysis

Results

All sources

Table 5

ChatGPT vs. Google Gemini

AI vs. guidelines and patient-centred online resources

Table 6

Readability scoring

Table 7

Discussion

Quality of response

Enhancing AI with prompt engineering

Evaluation of non-AI sources

Ease of readability

Limitations

Regulation of AI in Australia

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share