Data Analysis Tool Empowers the People

Despairing of data sets? Here’s an automatic, cloud-based system designed for data rookies

RNA molecules, polymers, antimicrobial resistance, Aging White Blood Cells, microviscosity, Transplant Drug, Nanophotonics, photonics, Built-In Nanobulbs, cerebral cortex, cancer cells, nanowires, optoelectronic, solar energy, gold nanowires, Chikungunya virus, concrete, glaucoma, light-emitting diode, Proteomics, nanostructures, nickel catalyst, Ultrafast lasers, liver capsular macrophages, obesity, cancer, lignin polymer, liver capsular macrophages, Ultrafast lasers, monocyte cells, cancer treatments, antibody drug, gene mutations, quantum-entangled photons, gut microbes, skin aging, stroke, machine learning, Cloned tumors, cancer, Rare Skin Disease, terahertz lasers, silicon-nanostructure pixels, oral cancer, heart muscle cells, cancer, cancer stem cells, gastric cancer, microelectromechanical systems, data storage, silicon nanostructures, Drug delivery, cancer, muscle nuclei, Lithography, silicon nanostructures, Quantum matter, robust lattice structures, potassium ions, Photothermal therapy, Photonic devices, Optical Components, retina, allergy, immune cells, catalyst, Nanopositioning devices, mold templates, lung cancer, cytoskeletons, hepatitis b, cardiovascular disease, memory deficits, Photonics, pre-eclampsia treatment, hair loss, nanoparticles, mobile security, Fluid dynamics, MXene, Metal-assisted chemical etching, nanomedicine, Colorectal cancer, cancer therapy, liver inflammation, cancer treatment, Semiconductor lasers, zika virus, catalysts, stem cells, fetal immune system, genetic disease, liver cancer, cancer, liver cancer, RNA editing, obesity, Microcapsules, genetic disease, Piezoelectrics, cancer, magnesium alloy, Quantum materials, therapeutic antibodies, diabetes, 2D materials, lithium-ion batteries, obesity, lupus, surfactants, Sterilization, skin on chip, Magnetic Skyrmions, cyber-security, wound infections, human genetics, immune system, eczema, solar cells, Antimicrobials, joint disorder, genetics, cancer

Data production doubles each year, but data scientists, who wrangle insights from reams of data, are in short supply. To bridge this gap, a team at A*STAR has developed a fully automatic, web-based system that puts the power of big data analysis in the hands of laypeople1.

Uncovering patterns and relationships hidden in vast data sets requires a machine learning pipeline or ‘workflow’ — a string of algorithms and processes called operators. But not every workflow is appropriate for every situation. So how does the non-expert know which to use? To help, Theint Theint Aye, from the A*STAR Institute of High Performance Computing and her colleagues have produced an analytics system (called the Layman Analytics System, or ‘LAS’) for the novice.

Say you have a data set to analyse. The first part of the LAS — the workflow recommender — compares your data set’s metadata to that of existing data sets in a repository. It then selects the best-performing workflows based on those similar repository data sets and passes them to the second part: the workflow optimizer.

Here, ‘genetic programming’ refines the workflow. Operators are randomly replaced, analogous to random genetic mutations in DNA. Mutated workflows are then crossed with each other, which involves swapping pairs of operators between them.

The process then repeats — ‘fittest’ workflows are selected, mutated and crossed — for a predefined number of generations (based on empirical experience). The result: an automatically generated tailor-made workflow.

The system is web based using cloud infrastructure, so there is no need to install special software or use dedicated computing power.

To evaluate whether the LAS generated appropriate workflows, Aye’s team tested it on 114 data sets from the University of California’s Irvine Machine Learning Repository and benchmarked against OpenML, an open-source, online machine learning platform.

For 87 data sets (about 76 per cent of the total), LAS-produced workflow accuracy was above the 50th percentile of OpenML’s performance. This figure could improve over time too, Aye says. Users can plug their data sets and workflows back into the repository, providing a richer stock from which the workflow recommender can later draw.

Non-experts usually take days to generate a good workflow, however, in LAS, the average time to produce a workflow in 15 generations was just over 3 hours.  In the future, implementing a faster search technique, or heuristic, could further cut processing time. “Obviously, we would want to run it as efficiently as possible and also have good accuracy values,” Aye says, adding that a graphics processing unit might also boost the LAS’s speed.

The A*STAR-affiliated researchers contributing to this research are from the Institute of High Performance Computing.

Source : A*STAR Research