London’s Natural History Museum has been called “Nature’s Treasurehouse”. Crammed with more than 80mn objects, the 272-year-old collection contains everything from a skeleton of Sophie the Stegosaurus to 12.5mn pinned specimens of butterflies and moths. But this rich trove is busy turning itself into a global digital resource that could open up new pathways for scientific research in our artificial intelligence age.
As he whizzes through the Jurassic gardens, the NHM’s director Doug Gurr enthuses about the possibilities of using the museum’s collections to deepen our understanding of the natural world and the effects of climate change. For example, researchers are currently studying specimens of Arctic krill, which probably constitute the world’s largest wild biomass, that were collected during the Scott and Shackleton expeditions more than a century ago. “We are now going back to the same locations and seeing how things are changing today,” he says.
As a former head of Amazon UK, Gurr is immersed in technology and is building a specialist AI team within the museum to act as a public good. The team is using pattern recognition software to help the UK Border Force identify endangered animal skins and translate museum texts into speech to assist visually impaired visitors.
But the NHM is also painstakingly digitising its collection one specimen at a time, creating an machine-readable scientific database.
Compendious as the NHM’s collections are, they contain only a minute fraction of nature’s data. The vast majority of the planet’s lifeforms are single-cell organisms that have yet to be recorded. If all the species equated to the scale of the Atlantic Ocean then we have sequenced the genomes of just five cups of water, says Glen Gowers, co-founder of Basecamp Research, a London-based start-up. Progress in biology remains a prodigious data challenge.
Basecamp — motto: beyond known biology — is in the business of sequencing as many species as it can. By better understanding the language of evolution, pharmaceutical, agricultural and chemical companies should then be able to create better products for consumers.
The company’s database currently contains more than 10bn genes across 400 terabases of genetic data, but it is steadily expanding with every exploratory expedition it mounts. It is already working in more than 20 countries, ranging from Costa Rica to Malawi and Malta.
How scientifically useful, or commercially valuable, such data sets may be is as yet hard to tell. Venture capital investors have lost a lot of money betting on start-ups aggregating different forms of data. For example, the genetic testing company 23andMe, once valued at $5.8bn, was sold for just $256mn earlier this year.
Gowers accepts that data is not so valuable by itself. It is what you do with the information that counts. To that end, Basecamp has partnered with Nvidia and Microsoft to create its own AI foundation models to analyse and interrogate the internet of biology. “We’re creating this digital twin of evolution and the natural world and allowing models to look into it,” he tells me.
As ever with data, the question arises: who should benefit from its derivatives? Some countries in the global south have been scarred by the “biopiracy” of the global north. In the late 20th century, controversies arose around western companies developing valuable pesticides, dietary products and blood pressure drugs based on native plants from India and southern Africa and venom taken from Brazilian vipers. However, Basecamp runs its own benefit-sharing programme supporting local partners and paying out 1 per cent of the revenues it generates from its data, exceeding international protocols.
Another hot debate is whether generative AI models are hitting a wall, having ingested the entire internet.
Simply throwing more computing power at the same data is unlikely to produce much progress. But Gowers believes that in creating fresh AI-friendly data sets, Basecamp can enable further advances to be made. “That’s why we’re scaling the data as an answer to get past that performance plateau,” he says.
Machines may be able to spot patterns and make connections that are undetectable to humans by combining a string of pixels of a plant with its DNA sequencing, for instance. The promise is that by digitising public collections, such as the NHM, and creating new data sets, we will be able to understand the world in new ways. The AI revolution still has a long way to run.