Difference between revisions of "Machine Learning"
From Living Building Science
|Line 17:||Line 17:|
=== Future Work ===
=== Future Work ===
== Hive Analysis ==
== Hive Analysis ==
Revision as of 20:13, 10 December 2021
Welcome to the Machine Learning Team Page! Our main goal is using machine learning and analysis tools to assist researchers from different disciplines in their studies about bees (Bee Snap team) and sustainable constructions (Living Building Science team).
Some highlights of our work:
- Developed models to classify flowers and bees, which are to be used in the game app BeeKeeper GO (another project within the Bee Snap space).
- Applied data analysis and computer vision to detect the occurrence of swarms in bee hives and the contributing factors/warning signs that will lead to a swarm occurring.
- Developed bird image & sound classification models, which are to be integrated with an ongoing project owned by the Biodiversity team.
- Develop an image classification model to detect varroa mites on honeybees (in-progress).
- Perform data analysis on drone data (in-progress).
- 1 Flower and Bee Classification
- 2 Hive Analysis
- 3 Swarm Time Series Analysis
- 4 Bird Classification
- 5 Varroa Mite Detection
- 6 Project Updates
- 7 Team Members
Flower and Bee Classification
[Fall 2020 - Spring 2021] ...
[Fall 2020 - Spring 2021]
Initially, we created a script and tracking algorithm to draw the paths of bee flight in a given video. In order to use this data to forecast bee swarming, we modified this algorithm to track the distance traveled by each bee in each frame and appended that to a CSV file. This CSV file is to be used as a feature to analyze the bee movements on campus and determine their relative speeds and active points. We decided not to further improve this CSV tracking after we decided to cut down on the projects this semester.
GitHub Repo: Hive Analysis
Swarm Time Series Analysis
We developed three primary models, ARIMA, ARIMAX, and SARIMAX, in order to make predictions on future hive weights based on current hive weight data.
ARIMA is a type of time series model that stands for Auto-Regressive Integrated Moving Average. ARIMA is commonly used for time series analysis because it is good at using information in the past lagged values of a time series (which is simply data mapped over time) to predict future values. As the name implies, the model consists of three primary components. The first component, the Auto-Regressive (AR) component, involves regressing the time series data onto a previous version of itself using lags, or prior values. The AR component is measured by a parameter p, which is the number of lag observations in the model. The second component, the Integrated (I) component, involves differencing the raw data in order to make the time series stationary. A time series is stationary if there are no trend or seasonal effects, and the overall statistical properties (such as mean and variance) are constant over time. The I component is measured by a parameter d, which is the number of times raw observations need to be differenced to become stationary. The final component, the Moving Average (MA) component, involves using the previous lagged errors to model the current error of the time series. The MA component is measured by a parameter q, which is the size of the moving average window.
Each ARIMA model is uniquely determined by its p, d, and q values. In order to determine which ARIMA (p, d, q) is best for forecasting hive weights using our data, we conducted various statistical tests such as the Ad-fuller (ADF) test and we observed various plots such as the auto-correlation (ACF) and partial auto-correlation (PACF) plots.
In addition to ARIMA, we also used two additional models, ARIMAX and SARIMAX, to forecast hive weights. Both ARIMAX and SARIMAX involve using exogenous (X) variables, which are essentially other variables in the time series that may be used to assist in forecasting the original variable. For our exogenous variables, we used hive temperature, hive humidity, ambient temperature, ambient humidity, and ambient rain. SARIMAX differs from ARIMAX in the sense that it takes seasonality (S) into account, as it is often used on datasets that have seasonal cycles.
After forecasting hive weights using our three models, we analyzed the error for all three sets of predictions using Mean Absolute Percentage Error (MAPE). We used MAPE as our error metric because it is fairly intuitive in the sense that it is simply an averaged measure of the difference between actual and forecast values, and it also adjusts for the volume of data. We found that the mean absolute percentage error for each model was 4.041 (for ARIMA), 4.049 (for ARIMAX), and 4.039 (for SARIMAX). The improvement from ARIMA to SARIMAX was therefore only 0.06%, which is very nominal. Because of this, we determined that the best approach going forward would be to use the ARIMA model, since it is still fairly accurate and it only requires weight data, making it more feasible for applying this model to Kendeda hive data, which is our ultimate goal for this sub-project.
In order to put this model into production, we decided to serialize the model and connect it to an AWS EC2 instance, which would allow other users to access a basic web application where they could input a hive weight dataset and receive predictions for the next 5 days' worth of hive weight values. In order to serialize the model, we used a Python package called pickle. This serialized model would then be loaded in a Flask app, where it would be fit to input data in order to make predictions on that data. This flask app was then finally connected to the AWS EC2 instance, making the prediction process accessible to all users.
To assist the Biodiversity team in learning more about the bird species around the Kendeda Building, we developed two models, one to classify birds based on images and one to classify birds based on sound.
The first step was to collect the data. For the bird images, this was a simple task since we simply found a dataset on kaggle that would meet our needs. For bird sound data, this was a tougher task. Dr. Weigel suggested a few websites for us to look into but that did not work out since you could only download sounds one at a time. For a machine learning model we would need a lot more data, so we went to kaggle and found a dataset, but it was too large for our uses and occupied around 25 GB of space (this would lead to very slow upload/download times). We decided to use DGX (the supercomputer on campus) to download these files and extract only 10 bird species worth of sound data so that we can train the model.
Next we had to actually construct the models. With the image data we could preprocess the data and feed it into a convolutional neural network (CNN) to train and validate with testing data. We separated the test data and training data by a ratio of 0.25 and fit them into the VGG model, ReduceLROnPlateau, and Mobilenet's. This takes around an hour to run and we would have to tune the model based on accuracy. For the sound data, we had a bit more work to do since sound data is not really suited well for a CNN. So we decided to map all of our MP3 files to spectrograms by using the Python Librosa module. After research, we found that we cannot preprocess spectrograms like traditional image data, so we had to process the sound frequencies and we decided to remove the low frequency sounds since bird songs are high frequency. We then stored both the unprocessed and the processed spectrograms as PNG image files and fed them into CNNs.
Above are example images of the spectrograms created after we converted a given audio file. These images were used in the CNN model as training and test data.
The bird image classification model had around 96% training accuracy and 89% test accuracy.
The bird sound classification model had around 25% accuracy with the processed spectrograms and a 71.8% accuracy with the unprocessed spectrograms.
When analyzing the images, we could tell that the processed spectrograms did not have many distinguishing features, but in the unprocessed spectrograms there is a clear contrast between the black and the purple/yellow bars. This may be why the model had a significantly higher accuracy since it is able to distinguish the sound waves a lot clearer.
For the bird sound model, we trained the model with only 10 species so we would have to work on collecting more data and increasing the number of species and retraining the model. Once that is completed, the next step would be feed the raw data collected on campus into our model and see how the predictions are.
Varroa Mite Detection
[Spring 2021 - Present]
We learned that varroa mites can be very destructive to bee hives, hence being one of the biggest threats to the bee population as a whole. At the early stages of varroa infestation, bee colonies generally show very few symptoms. If varroa mites can be detected early on, beekeepers could eliminate them and save their hives. With the aim of protecting and promoting bee health, we developed a an image classification model applying convolutional layers to detect varroa mites on honey bees.
To train the deep learning model, we used the Honey Bee Annotated Dataset from Kaggle which consists of approximately 4100 images (with 76% healthy bees and 24% varroa-infected bees). The model was tuned and trained to eliminate overfitting and achieve training accuracy of approximately 96.7% and evaluating accuracy of approximately 97.3%. However, the model does not perform well when tested with images of higher resolution (images found off Google search).
We suspected that the issues with underperformance of the model, despite of high accuracy during training and evaluation, was due to the difference in image resolution between images in the dataset and test images (even after image preprocessing on test images). We expect to have a camera set up by the hive at the Kendeda building next semester, and the images taken by this camera will be used to test and tune the current varroa detection model for better performance.
Week 1: Introduction to Living Building Science and new members
Week 2: Assignment to sub-teams
Week 4: Slides for 2/9/2021
Week 6: Slides for 2/23/2021
Week 8: Slides for 3/9/2021
Week 9: Wellness Day
Week 11: Slides for 3/30/2021
Week 13: Slides for 4/13/2021
Week 15: Final Presentation
|Sukhesh Nuthalapati||Computer Science||Spring 2020 - Spring 2021|
|Rishab Solanki||Computer Science||Spring 2020 - Spring 2021|
|Sneh Shah||Computer Science||Spring 2020 - Spring 2021|
|Jonathan Wu||Computer Science||Spring 2020 - Spring 2021|
|Daniel Tan||Computer Science||Fall 2020 - Fall 2021|
|Quynh Trinh||Computer Science||Fall 2020 - Fall 2021|
|Chloe Devre||Computer Science||Fall 2021 - Present|
|Crystal Phiri||Computer Science||Fall 2021 - Present|
|Sarang Desai||Computer Science||Fall 2021 - Present