In This Story
With the advancement of data collection techniques, there has been an exponential increase in the availability and complexity of datasets, particularly spatiotemporal data; finding the computing power to analyze such Big Data, however, has remained a challenge for many researchers in various fields. Through a collaborative research project funded by the National Science Foundation, George Mason 麻豆国产 statistics professor Lily Wang hopes to change that.聽聽

Wang and the Chair of the Department of Statistics at The George Washington 麻豆国产, , are developing a form of scalable, distributed computing that could lessen the power demand on any single computer by distributing the analysis across a network of computers.聽聽
鈥淚n the past, we knew there were insights hidden in the data, but due to computing limitations, we couldn鈥檛 access them,鈥 said Lily Wang. 鈥淣ow, with scalable quantile learning techniques, we can gain a deeper understanding of the entire data distribution and extract insights into variability, outliers, and tail behavior, which are critical for more informed decision-making.鈥澛
Spatial and temporal data are increasingly being used in such research areas as climate study and health care, among others, noted Lily Wang.聽
鈥淭his data richness presents a lot of opportunities for getting deep insights into dynamic patterns over time and space; but it also brings many, many challenges,鈥 said Wang. Large datasets often exhibit heterogeneous and dynamic patterns, requiring new approaches to capture meaningful relationships.聽
This project uses two large datasets: the National Environmental Public Health Tracking Network database from the Centers for Disease Control and Prevention and the outdoor air quality data repository from the Environmental Protection Agency.聽
鈥淏oth datasets have been challenging to analyze in the past due to their size and complexity,鈥 explained Wang. 鈥淏ut through scalable and distributed learning techniques, we鈥檙e now able to handle large-scale heterogeneous data across the entire United States.鈥澛
One of the project's major innovations is the use of distributed computing to divide the data into smaller, manageable regions. Each region is analyzed separately, and the results are efficiently aggregated to form a comprehensive understanding of the entire dataset.聽聽
鈥淵ou can think of it like dividing the U.S. into small regions, analyzing each one separately, and then combining the results to create a comprehensive national analysis,鈥 Wang said. 鈥淭his method allows us to analyze millions of data points simultaneously without the need for supercomputers.鈥澛
Beyond its goals for technical advancements, the project also emphasizes training the next generation of data scientists. Graduate students at George Mason and The George Washington will gain hands-on experience working with real-world data, helping to develop new computational methods.聽聽
The project began on September 1, 2024, and is expected to last three years. It has already garnered attention, including recognition from the office of Congressman Gerry Connolly (D-VA).聽
The potential applications of this research are far-reaching, from improving air quality predictions to understanding public health trends and beyond. Wang explained, "This work empowers researchers and policymakers to leverage vast amounts of data to address rising societal issues more effectively.鈥澛