AI Insights

Artificial intelligence is revolutionising medical image analysis

Published

4 weeks ago

August 5, 2025

The Editors

By
Naomi Stekelenburg

6 August 2025
4 min read

Key points

AI is now a prominent feature of the healthcare landscape.

One type of AI called visual language models is being used to “read” X-rays and generate reports.

The technology will not replace human analysis but provide a tool to support radiologists.

One in two Australians regularly use artificial intelligence (AI), with that number expected to grow. AI is showing up in our lives more prominently than ever, with the arrival of ChatGPT and other chatbots.

Researchers at CSIRO’s Australian e-Health Research Centre (AEHRC) are exploring how AI – including the systems that underpin chatbots – can be leveraged for a more altruistic endeavour: to revolutionise healthcare.

Earlier versions of ChatGPT were built on an AI system called a large language model (LLM) and were entirely text-based. You would ‘talk’ to it by entering text.

The latest version of ChatGPT, for instance, incorporates visual-language models (VLM) which add visual understanding on top of the LLM’s language skills. This allows it to ‘see’, describe what it ‘sees’ and connect it to language.

AEHRC researchers are now using VLMs to help interpret medical images such as X-rays.

It’s complicated technology, but the aim is straightforward: to support radiologists and reduce the burden on them.

This work enables automated reporting of X-rays

Visual language models are transforming X-ray analysis

Dr Aaron Nicolson, Research Scientist at AEHRC, is one of the researchers working on the project.

He said any kind of image can be used with VLMs, but his team is focusing on chest X-rays.

Chest X-rays are used for many important reasons, including to diagnose heart and respiratory conditions, screen for lung cancers and to check the positioning of medical devices such as pacemakers.

Typically, trained specialists – radiologists – are required to interpret the complex images and produce a diagnostic report.

But in Australia, radiologists are overburdened.

“There are too few radiologists for the mountain of work that needs to be completed,” Aaron said.

The problem will likely get worse with the number of patients and chest X-rays taken set to keep increasing, especially as the population ages.

That’s why Aaron is developing a model that uses a VLM to produce radiology reports from chest X-rays.

“The goal is to create technology that can integrate into radiologists’ workflow and provide assistance,” he said.

Man typing at computer at desk, with smile on face — Aaron Nicolson working on his model for automated X-ray reporting

Practice makes (almost) perfect

Training the VLM involves lots of data. The more information a model has, the better it can make predictions.

The VLM is given the same information that a radiologist would receive – X-ray images and the patient’s referral, Aaron explained.

“Then we give the model the matching radiology report written by a radiologist. The model learns to produce a report based on the images and information it is given,” he said.

Like humans, AI models improve by practising.

“We train the model using hundreds and thousands of X-rays. As the model trains on more data, it can produce more accurate reports,” said Aaron.

At this stage of his research, Aaron was looking to improve the accuracy of the reports even further – so he decided to try something new.

“We gave model the patient’s records from the emergency department as well,” he said.

“That means information like the patient’s chief complaint when triaged, their vital signs over the course of the stay, the medications they usually take and the medications administered during the patient’s stay.”

Just as he had hoped, giving the model this extra information improved the accuracy of the radiology reports.

“We are trying to get the technology to a point where it can be considered for prospective trials. This is a big step in that direction,” he said.

Infographic describing the workflow of a multimodal large language AI model — Workflow of the large language model.

Ethical and applicable AI

As well as generating diagnostic reports from chest X-ray images, AEHRC is exploring other applications of VLMs.

Dr Arvin Zhuang, at post-doc at AEHRC is using VLMs to retrieve information from images of medical documents. Processing the documents as an image rather than text enables the information to be retrieved more efficiently.

It’s an exciting time for Aaron and Arvin, but ethical and safety considerations are always at the front of their minds.

“We want to make sure that the model is effective for all populations. To do that, we have to consider and manage issues like demographic biases in the data we train our models on,” Aaron said.

He also notes that the technology is not designed to replace medical specialists.

“The technology will not be making clinical decisions by itself. There will always be a radiologist in the loop,” Aaron said.

Aaron and his team are currently conducting a trial of the technology in collaboration with the Princess Alexandra Hospital in Brisbane, assessing how the AI-generated reports compare with those produced by human radiologists.

They are also actively seeking additional clinical sites to participate in further trials, to evaluate the technology’s effectiveness across a broader range of settings.

Source link

Related Topics:advancement AI in healthcare Health Human Health Visual language models x-ray

Up Next

Artificial Intelligence and the Battle for the Human Soul

Don't Miss

Farmfest panel talks future of artificial intelligence in farming

The Editors

Click to comment

AI Insights

Asia Fund Beating 95% of Peers Is Bullish on Chip Gear Makers

Published

1 hour ago

September 2, 2025

The Editors

Chinese chipmakers are trading at a four-year high versus their US peers, but a top fund manager still sees pockets of opportunity among their equipment suppliers.

Source link

AI Insights

Deep computer vision with artificial intelligence based sign language recognition to assist hearing and speech-impaired individuals

Published

1 hour ago

September 1, 2025

The Editors

This study proposes a novel HHODLM-SLR technique. The presented HHODLM-SLR technique mainly concentrates on the advanced automatic detection and classification of SL for disabled people. This technique comprises BF-based image pre-processing, ResNet-152-based feature extraction, BiLSTM-based SLR, and HHO-based hyperparameter tuning. Figure 1 represents the workflow of the HHODLM-SLR model.

Fig. 1

Workflow of HHODLM-SLR model.

Image Pre-preprocessing

Initially, the HHODLM-SLR approach utilized BF to eliminate noise in an input image dataset³⁸. This model is chosen due to its dual capability to mitigate noise while preserving critical edge details, which is crucial for precisely interpreting complex hand gestures. Unlike conventional filters, such as Gaussian or median filtering, that may blur crucial features, BF maintains spatial and intensity-based edge sharpness. This confirms that key contours of hand shapes are retained, assisting improved feature extraction downstream. Its nonlinear, content-aware nature makes it specifically efficient for complex visual patterns in sign language datasets. Furthermore, BF operates efficiently and is adaptable to varying lighting or background conditions. These merits make it an ideal choice over conventional pre-processing techniques in this application. Figure 2 represents the working flow of the BF model.

BF is a nonlinear image processing method employed for preserving edges, whereas decreasing noise in images makes it effective for pre-processing in SLR methods. It smoothens the image by averaging pixel strengths according to either spatial proximity or intensity similarities, guaranteeing that edge particulars are essential for recognizing hand movements and shapes remain unchanged. This is mainly valued in SLR, whereas refined edge features and hand gestures are necessary for precise interpretation. By utilizing BF, noise from environmental conditions, namely background clutter or lighting variations, is reduced, improving the clearness of the input image. This pre-processing stage helps increase the feature extraction performance and succeeding detection phases in DL methods.

Feature extraction using ResNet-152 model

The HHODLM-SLR technique implements the ResNet152 model for feature extraction³⁹. This model is selected due to its deep architecture and capability to handle vanishing gradient issues through residual connections. This technique captures more complex and abstract features that are significant for distinguishing subtle discrepancies in hand gestures compared to standard deep networks or CNNs. Its 152-layer depth allows it to learn rich hierarchical representations, enhancing recognition accuracy. The skip connections in ResNet improve gradient flow and enable enhanced training stability. Furthermore, it has proven effectualness across diverse vision tasks, making it a reliable backbone for SL recognition. This depth, performance, and robustness integration sets it apart from other feature extractors. Figure 3 illustrates the flow of the ResNet152 technique.

The renowned deep residual network ResNet152 is applied as the pre-trained system in deep convolutional neural networks (DCNN) during this classification method. This technique is responsible for handling the problem of vanishing gradients. Then, the ResNet152 output is transferred to the SoftMax classifier (SMC) in the classification procedure. The succeeding part covers the process of categorizing and identifying characteristics. The fully connected (FC) layer, convolution layer (CL), and downsampling layers (DSL) are some of the most general layers that constitute a DCNN (FCL). The networking depth of DL methods plays an essential section in the model of attaining increased classifier outcomes. Later, for particular values, once the CNN is made deeper, the networking precision starts to slow down; however, persistence decreases after that. The mapping function is added in ResNet152 to reduce the influence of degradation issues.

$$\:W\left(x\right)=K\left(x\right)+x$$

(1)

Here, $\:W\left(x\right)$ denotes the function of mapping built utilizing a feedforward NN together with SC. In general, SC is the identity map that is the outcome of bypassing similar layers straight, and $\:K(x,\:{G}_{i})$ refers to representations of the function of residual maps. The formulation is signified by Eq. (2).

$$\:Z=K\left(x,\:{G}_{i}\right)+x$$

(2)

During the CLs of the ResNet method, $\:3\text{x}3$ filtering is applied, and the down-sampling process is performed by a stride of 2. Next, short-cut networks were added, and the ResNet was built. An adaptive function is applied, as presented by Eq. (3), to enhance the dropout’s implementation now.

$$\:u=\frac{1}{n}{\sum\:}_{i=1}^{n}\left[zlog{(S}_{i})+\left(1-z\right)log\left(1-{S}_{i}\right)\right]$$

(3)

Whereas $\:n$ denotes training sample counts, $\:u$ signifies the function of loss, and $\:{S}_{i}$ represents SMC output, the SMC is a kind of general logistic regression (LR) that might be applied to numerous class labels. The SMC outcomes are presented in Eq. (4).

$$\:{S}_{i}=\frac{{e}^{{l}_{k}}}{{\varSigma\:}_{j=1}^{m}{e}^{{y}_{i}}},\:k=1,\:\cdots\:,m,\:y={y}_{1},\:\cdots\:,\:{y}_{m}$$

(4)

In such a case, the softmax layer outcome is stated. $\:{l}_{k}$ denotes the input vector component and $\:l,$ $\:m$ refers to the total neuron counts established in the output layer. The presented model uses 152 10 adaptive dropout layers (ADLs), an SMC, and convolutional layers (CLs).

SLR using Bi-LSTM technique

The Bi-LSTM model employs the HHODLM-SLR methodology for performing the SLR process⁴⁰. This methodology is chosen because it can capture long-term dependencies in both forward and backward directions within gesture sequences. Unlike unidirectional LSTM or conventional RNNs, Bi-LSTM considers past and future context concurrently, which is significant for precisely interpreting the temporal flow of dynamic signs. This bidirectional learning enhances the model’s understanding of gesture transitions and co-articulation effects. Its memory mechanism effectually handles variable-length input sequences, which is common in real-world SLR scenarios. Bi-LSTM outperforms static classifiers like CNNs or SVMs when dealing with sequential data, making it highly appropriate for recognizing time-based gestures. Figure 4 specifies the Bi-LSTM method.

The presented DAE-based approach for removing the feature is defined here. Additionally, Bi-LSTM is applied to categorize the data. The model to solve classification problems consists of the type of supervised learning. During this method, the Bi‐LSTM classification techniques are used to estimate how the proposed architecture increases the performance of the classification. A novel RNN learning model is recommended to deal with this need, which may enhance the temporal organization of the structure. By the following time stamp, the output is immediately fed reverse itself$\:.$ RNN is an approach that is often applied in DL. Nevertheless, RNN acquires a slanting disappearance gradient exploding problem. At the same time, the memory unit in the LSTM can choose which data must be saved in memory and at which time it must be deleted. Therefore, LSTM can effectively deal with the problems of training challenges and gradient disappearance by mine time-series with intervals in the time-series and relatively larger intervals. There are three layers in a standard LSTM model architecture: hidden loop, output, and input. The cyclic HL, by comparison with the traditional RNN, generally contains neuron nodes. Memory units assist as the initial module of the LSTM cyclic HLs. Forget, input and output gates are the three adaptive multiplication gate components enclosed in this memory unit. All neuron nodes of the LSTM perform the succeeding computation: The input gate was fixed at $\:t\:th$ time according to the output result $\:{h}_{t-1}$ of the component at the time in question and is specified in Eq. (5). The input $\:{x}_{t}$ accurate time is based on whether to include a computation to upgrade the present data inside the cell.

$$\:{i}_{t}={\upsigma\:}\left({W}_{t}\cdot\:\left[{h}_{t-1},\:{x}_{t}\right]+{b}_{t}\right)$$

(5)

A forget gate defines whether to preserve or delete the data according to the additional new HL output and the present-time input specified in Eq. (6).

$$\:{f}_{\tau\:}={\upsigma\:}\left({W}_{f}\cdot\:\left[{h}_{t-1},{x}_{\tau\:}\right]+{b}_{f}\right)$$

(6)

The preceding output outcome $\:{h}_{t-1}$ of the HL-LSTM cell establishes the value of the present candidate cell of memory and the present input data $\:{x}_{t}$. * refers to element-to-element matrix multiplication. The value of memory cell state $\:{C}_{t}$ adjusts the present candidate cell $\:{C}_{t}$ and its layer $\:{c}_{t-1}$ forget and input gates. These values of the memory cell layer are provided in Eq. (7) and Eq. (8).

$$\:{\overline{C}}_{\text{t}}=tanh\left({W}_{C}\cdot\:\left[{h}_{t-1},\:{x}_{t}\right]+{b}_{C}\right)$$

(7)

$$\:{C}_{t}={f}_{t}\bullet\:{C}_{t-1}+{i}_{t}\bullet\:\overline{C}$$

(8)

Output gate $\:{\text{o}}_{t}$ is established as exposed in Eq. (9) and is applied to control the cell position value. The last cell’s outcome is $\:{h}_{t}$, inscribed as Eq. (10).

$$\:{o}_{t}={\upsigma\:}\left({W}_{o}\cdot\:\left[{h}_{t-1},\:{x}_{t}\right]+{b}_{o}\right)$$

(9)

$$\:{h}_{t}={\text{o}}_{t}\bullet\:tanh\left({C}_{t}\right)$$

(10)

The forward and backward LSTM networks constitute the BiLSTM. Either the forward or the backward LSTM HLs are responsible for removing characteristics; the layer of forward removes features in the forward directions. The Bi-LSTM approach is applied to consider the effects of all features before or after the sequence data. Therefore, more comprehensive feature information is developed. Bi‐LSTM’s present state comprises either forward or backward output, and they are specified in Eq. (11), Eq. (12), and Eq. (13)

$$\:h_{t}^{{forward}} = LSTM^{{forward}} (h_{{t – 1}} ,\:x_{t} ,\:C_{{t – 1}} )$$

(11)

$$\:{h}_{\tau\:}^{backwar\text{d}}=LST{M}^{backwar\text{d}}\left({h}_{t-1},{x}_{t},\:{C}_{t-1}\right)$$

(12)

$$\:{H}_{T}={h}_{t}^{forward},\:{h}_{\tau\:}^{backwar\text{d}}$$

(13)

Hyperparameter tuning using the HHO model

The HHO methodology utilizes the HHODLM-SLR methodology for accomplishing the hyperparameter tuning process⁴¹. This model is employed due to its robust global search capability and adaptive behaviour inspired by the cooperative hunting strategy of Harris hawks. Unlike grid or random search, which can be time-consuming and inefficient, HHO dynamically balances exploration and exploitation to find optimal hyperparameter values. It avoids local minima and accelerates convergence, enhancing the performance and stability of the model. Compared to other metaheuristics, such as PSO or GA, HHO presents faster convergence and fewer tunable parameters. Its bio-inspired nature makes it appropriate for complex, high-dimensional optimization tasks in DL models. Figure 5 depicts the flow of the HHO methodology.

The HHO model is a bio-inspired technique depending on Harris Hawks’ behaviour. This model was demonstrated through the exploitation or exploration levels. At the exploration level, the HHO may track and detect prey with its effectual eyes. Depending upon its approach, HHO can arbitrarily stay in a few positions and wait to identify prey. Suppose there is an equal chance deliberated for every perched approach depending on the family member’s position. In that case, it might be demonstrated as condition $\:q<0.5$ or landed at a random position in the trees as $\:q\ge\:0.5$, which is given by Eq. (14).

$$\:X\left(t+1\right)=\left\{\begin{array}{l}{X}_{rnd}\left(t\right)-{r}_{1}\left|{X}_{rnd}\left(t\right)-2{r}_{2}X\left(t\right)\right|,\:q\ge\:0.5\\\:{X}_{rab}\left(t\right)-{X}_{m}\left(t\right)-r3\left(LB+{r}_{4}\left(UB-LB\right)\right),q<0.5\end{array}\right.$$

(14)

The average location is computed by the Eq. (15).

$$\:{X}_{m}\left(t\right)=\frac{1}{N}{\sum\:}_{i=1}^{N}{X}_{i}\left(t\right)$$

(15)

The movement from exploration to exploitation, while prey escapes, is energy loss.

$$\:E=2{E}_{0}\left(1-\frac{t}{T}\right)$$

(16)

The parameter $\:E$ signifies the prey’s escape energy, and $\:T$ represents the maximum iteration counts. Conversely, $\:{E}_{0}$ denotes a random parameter that swings among $\:(-\text{1,1})$ for every iteration.

The exploitation level is divided into hard and soft besieges. The surroundings $\:\left|E\right|\ge\:0.5$ and $\:r\ge\:0.5$ should be met in a soft besiege. Prey aims to escape through certain arbitrary jumps but eventually fails.

$$\:\begin{array}{c}X\left(t+1\right)=\Delta X\left(t\right)-E\left|J{X}_{rabb}\left(t\right)-X\left(t\right)\right|\:where\\\:\Delta X\left(t\right)={X}_{rabb}\left(t\right)-X\left(t\right)\end{array}$$

(17)

$\:\left|E\right|<0.5$ and $\:r\ge\:0.5$ should meet during the hard besiege. The prey attempts to escape. This position is upgraded based on the Eq. (18).

$$\:X\left(t+1\right)={X}_{rabb}\left(t\right)-E\left|\varDelta\:X\left(t\right)\right|$$

(18)

The HHO model originates from a fitness function (FF) to achieve boosted classification performance. It outlines an optimistic number to embody the better outcome of the candidate solution. The minimization of the classifier error ratio was reflected as FF. Its mathematical formulation is represented in Eq. (19).

$$\begin{gathered} fitness\left( {x_{i} } \right) = ClassifierErrorRate\left( {x_{i} } \right)\: \hfill \\ \quad \quad \quad \quad \quad\,\,\, = \frac{{number\:of\:misclassified\:samples}}{{Total\:number\:of\:samples}} \times \:100 \hfill \\ \end{gathered}$$

(19)

Source link