In my senior year of college, I studied air quality modeling with Dr. Sofia Piltz. This was done through a physics-based approach, as well as a statistical modeling approach. In this project I learned a lot about modeling with differential equations, and the difficulty of modeling complex natural processes. Even attempting to predict the air quality three days from now is quite difficult.
Using a conglomeration of many NOAA datasets, I've taken this project a bit farther myself. The process of acquiring and cleaning the NOAA data was unlike anything I've experienced before. There seems to be no process or standards for the data, and many of the datasets have little to no desciption of the data in the data files, the only description existing in natural language on a page that links to the data, but is not linked from the data. Datasets that seem like they would be stored together are often stored in different areas, with datasets from completely separate projects. All of this makes it very difficult, if not impossible, to programmatically scrape data. Nevertheless, I succeeded in acquiring many of these datasets and compiling them into a large datasete that I could use to test and validate models.
I spent quite a bit of time refining the differential equations models, and had some success there. The tough part of this project is that there is a large and very important term that is difficult to find, or even collect, data on. This term is known in the literature as "sources and sinks" and is used to describe all of the factors that produce particulate emissions, as well as those that sequester them. This means tracking highway and city traffic, factory emissions, and agricultural emissions to name a few of the larger factors. It also means understanding the carbon sequestration in the surrounding area and chemical scrubbing occurring in the atmosphere. All of this adds up to an incredibly complex and important term that is very difficult to find data on. The conclusion I came to was that without a large amount of time, or a better dataset, I was going to havee trouble getting farther in this project.