Introduction

This page contains the project materials for UCSD's Data Science Capstone sequence.

Projects are grouped into subject-matter areas called domains of inquiry, led by the domain mentors listed below.

Each project listing contains:
  • The title and abstract,
  • A link to the project's website
  • A link to the project's code repository

Areas of Study


Projects

DomainProject
Explainable AI
Determining Correct Face Mask Usage with Inception Resnet and MaskedFace-Net Dataset
Face Mask Detection with Explainable Artificial Intelligence
Explaining Image Captioning Models Through Attention Maps, Image Perturbations, and Object Importance Maps
Image Recognition in Stock Prediction with Visual Explanations from Grad-CAM
Demystifying Machine Learning for Facial Analysis
Snake Classification using Neural-Backed Decision Trees
Autonomous Vehicles
Autoware and LGSVL
Computer Vision and Lane Segmentation in Autonomous Vehicles
Autonomous Mapping, Localization and Navigation using Computer Vision as well as Tuning of Camera
Data Visualizations and Interface For Autonomous Robots
GPS Based Autonomous Navigation on the 1/5th Scale
Autonomous: Odometry and IMU
Malware and Graph Learning
Malware Detection
Potential Improvement of MAMADROID System
Exploring the Language of Malware
CoCoDroid: Detecting Malware By Building Common Graph Using Control Flow Graph
Attacking the HinDroid Malware Detector
Text Mining and NLP
AutoPhrase Application Web
Analyzing Movies Using Phrase Mining
AutoPhrase for Financial Documents Interpretation
Text Classification with Named-Entity Recognition and AutoPhrase
AutoLibrary - A Personal Digital Library to Find Related Works via Text Analyzer
Restaurant Recommender System
Recommender Systems
ForumRec - A Question Recommender for the Super User Community
OnSight: Outdoor Rock Climbing Recommendations
Bridging the Gap: Solving Music Disputes with Recommendation Systems
Asnapp - Workout Video Recommender
Plates4U
Makeup Recommender
Opioid Overdose Prevalence Analysis
Opioid Overdose Prevalence Analysis
Large-scale Multiple Testing
Multiple Testing Method with Empirical Null Distribution in Leukemia Studies
Large-scale Multiple Testing with Empirical Null Distribution in Predicting Cardiovascular Disease
Spatial-temporal Analyses of Infectious Disease Dynamics
Spatial-temporal Prediction of COVID-19 Case Counts Through Epidemiology Model
Graph Data Analysis
A Graph ML Analysis of Senatorial Twitter Accounts
GCN on 3d Points
Graph-Based Product Recommendation
NBA Game Prediction
NBA Seeds with Graph Neural Networks
Stock Market Sentiment Predictor
The Spread of Misinformation
Political Popularity of Misinformation
The Sentiment of U.S. Presidential Elections on Twitter
Community Effects From Misinformation Flags on Twitter
Political Polarization of Major News Networks on Twitter
Twitter’s Impact on Elections
Analyzing the the Diffusions of Various Forms of Misinformation on Reddit
COVID-19 Sentiment and Daily ases Analysis on Social Media
Conflict and Collaboration in Online Communities
Controversy In Wikipedia Articles
The Large-Scale Collaborative Presence of Online Fandoms
Wikipedia’s Response to the COVID-19 Pandemic
A Study of LGBTQ+ Wikipedia Articles Sentiment over Time
Politics on Wikipedia
Genetic Basis of Mental Health
Differential Gene Expression Analysis of Human Opioid Abusers
The Genetic Basis of Antibiotic Resistance in E. Coli
Blood-based Analysis of Alzheimer's Disease from miRNA Data
Comparison of Differential Gene Expression Analysis Tools
Genetic Overlap between Alzheimer's, Parkinson’s, and healthy patients
VPN X-Ray
Live vs. Video on Demand Inside VPN Detection
DANE: Data Automation and Network Emulation Tool
Res Recovery: Classifying Video Resolutions Through a VPN Tunnel
SPICIVPN
Particle Physics
Interpreting Higgs Boson Interaction Network with Layerwise Relevance Propagation
Deep Learning for Particle Jet Multiclassification
COVID-19 & Microbiome
RTL Automation
Cyber-Physical Systems (CPS) using IOT Devices
AUTOBRICK: A system for end-to-end automation of building point labels to Brick turtle files
Airborne Infection Risk Estimator for COVID-19
System Usage Reporting (SUR, a.k.a. DCA)
Mouse Wait Classification
Predicting Battery Remaining Minutes based Related Features
Persona Analysis
Predicting a User’s Persona Using Computer’s Specifications, CPU Utilization, CPU Temperature & Application Usage Time
Spatial Agent-based Modeling for School Reopening
Geographically Assisted Agent-based Model for COVID-19 Transmission (GeoACT)
Modelling COVID-19 Transmission in San Diego School Buses
COVID-19 Spatial Agent-based Modeling: Single Room Infection

Project Details

Explainable AI


Determining Correct Face Mask Usage with Inception Resnet and MaskedFace-Net Dataset

  • Group members: Pratyush Juneja, Eric Kang, Minjoo Kim

Materials:

Abstract: Many models and algorithms in artificial intelligence are considered “black box models”, or models that do not provide transparency into how it reached a certain conclusion; ultimately, although they may be accurate, they are uninterpretable by humans and lack trust and transparency. We aim to provide this transparency through explainable artificial intelligence. First and foremost, we aim to present a model that can determine if individuals are properly wearing a mask, improperly wearing a mask, or are not wearing a mask at all in the light of the Covid-19 pandemic. Especially as this is a high stakes situation, with businesses and individuals at risk, transparency is key. Secondly, we aim to provide this transparency through GradCAM, which will highlight how our model came to its decision. The method uses an untrained Inception Resnet V1 in order to determine the mask usage in a given image. Gradient Descent is used for training and Cross Entropy as a loss function. Finally, GradCAM is applied to the images and outputs a coarse heatmap from the last layers of our convolutional neural network that shows exactly what our model is looking at. Currently, we are able to reach a model with 96% accuracy.

Face Mask Detection with Explainable Artificial Intelligence

  • Group members: Gavin Tran, Athena Liu, Che-Wei Lin

Materials:

Abstract: This report addresses the concern of careless mask wearing due to the ongoing pandemic, which has caused small businesses and large businesses alike financial problem. As a result, we tackle this issue by building a face mask detector that will recognize whether a person is wearing a mask. More importantly, not only is our face mask detector able to detect whether a person has a mask on, it can also detect whether the person is wearing a mask correctly, or with both chin and nose covered. Our detector was trained on a dataset called MaskedFace-Net, which contains more than 35000 images, and was able to fit the training data while performing even better on the validation and test set. It is able to achieve 88% a 95% accuracy on training set and both validation set and test set, respectively.

Explaining Image Captioning Models Through Attention Maps, Image Perturbations, and Object Importance Maps

  • Group members: Yuexiang Zhang, Alejandro Fosado, Jordan Levy

Materials:

Abstract: Image captioning models are complex because they work on object detection as well as caption generation. When these models fail it is hard to understand where and why they fail. To explain how an image captioning model works, we use attention maps to visualize the relationships between generated words and objects in an image. Moreover, we utilize an image perturbation model to alter regions of images to see how the cap- tions change and to test the robustness of our model by measuring the similarity between captions gener- ated before and after the altering of the image.

Image Recognition in Stock Prediction with Visual Explanations from Grad-CAM

  • Group members: Sohyun Lee, Shin Ehara, Jou-Ying Lee

Materials:

Abstract: Deep learning architectures are now publicly recognized and repeatedly proven to be powerful in a wide range of high-level prediction tasks. While these algorithms’ modeling generally have beyond satisfactory performances with apposite tuning, the long-troubling issue of this specific learning lies in the un-explainability of model learning and predicting. This interpretability of “how” machines learn is often times even more important than ensuring machines outputting “correct” predictions. Especially in the field of finance, users’ ability to dissect how and why an algorithm reached a conclusion from a business standpoint is integral for later applications of i.e., to be incorporated for business decision making, etc. This project studies similar prior work done on image recognition in the financial market and takes a step further on explaining predictions outputted by the Convolutional Neural Network by applying the Grad-CAM algorithm.

Demystifying Machine Learning for Facial Analysis

  • Group members: Ting Hung, Sudiksha Sarvepalli, Nicole Lee

Materials:

Abstract: Due to the burgeoning of machine learning and artificial intelligence technology, it may feel as though there are eyes perpetually watching us. It is undeniable that, whether it is through surveillance cameras, phones, or desktops, we are always exposed to being analyzed by merely living our everyday lives. The most frightening part about this phenomenon is that most people are unaware of what is actually being seen and how. As our society begins to yield more responsibility and credibility to image analysis and other machine learning software, it is important to educate the public so that we collectively make informed decisions about them. Our interactive web application conducts facial analysis and utilizes explainable artificial intelligence (XAI) to aid in communicating the inner-workings of the machine learning "black box." In addition, we discuss the importance of model fairness, role of XAI in ensuring fairness, and potential discriminatory practices that stem from the imprudent use of machine learning.

Snake Classification using Neural-Backed Decision Trees

  • Group members: Rui Zheng, Weihua Zhao, Nikolas Racelis-Russell

Materials:

Abstract: Many advanced algorithms, specifically deep learning models, are considered “black box” to human understanding. Transparency to intrprete such models has become a key obstacle which prevents such algorithms from being put into practical use. Although algorithms, such as GradCam, are invented to provide visual explanations from deep networks via gradient-based localization, they do not provide details of how the models reached their final decision step by step in detail. The goal of this project is to provide more interpretability to Convolutional Neural Networks (CNN) models by combining Grad-CAM with Neural Backed Decision Trees (NBDTs), and provide visual explanations with detailed decision making process of CNN models. This project demonstrates the potential and limitations of jointly applying Grad-CAM and NBDTs on snake classification.

Autonomous Vehicles


Autoware and LGSVL

  • Group members: Andres Bernal, Amir Uqdah, Jie Wang

Materials:

Abstract: We were able to replicate the ThunderHill race track using the Unity 3D game engine and integrated Unity with the track and robot into the LGSVL simulator. Once the integration was complete we were able to see our robot with the Thunderhill Track as our map in the simulator. We were then able to virtualize the functions of the IMU, odometry and lidar sensors and RGB-D cameras to better visualize what our robot perceives in the simulation. Finally we were able to fully visualize what our robot sees with the virtual sensors using Autoware Rviz which displays the location and point cloud map of the vehicle and its surroundings.

Computer Vision and Lane Segmentation in Autonomous Vehicles

  • Group members: Evan Kim, Joseph Fallon, Ka Chan

Materials:

Abstract: Perception is absolutely vital to navigation. Without perception, any corporeal entity cannot localize itself in its environment and move around obstacles. To create an autonomous vehicle (AV) capable of racing on the Thunderhill Raceway track in Berkeley, California, the team must supplement a stereo camera capable of supporting image perception and computer vision. The team was given two cameras, the Intel RealSense D455 and the ZED camera. In this analysis, the team will compare the capabilities of the two cameras and responsibly select a camera capable of supporting the object detection, and will develop a lane segmentation algorithm that will help extract lanes from the camera feed.

  • Group members: Youngseo Do, Jay Chong, Siddharth Saha

Materials:

Abstract: One of the main tools used in autonomous mapping and navigation is a 3D Lidar. A 3D Lidar provides various advantages. It is not sensitive to light conditions, it can detect color through reflective channels, it has a complete 360 degree view of the environment and does not require any ”learning” to detect obstacles. One can use the reflective channel to detect the color of lanes as well as avoid obstacles. The pointcloud information from the Lidar can also easily enable mapping and localization as the vehicle will know where it is at all points. It is easy to see why so many large scale autonomous vehicle units invest in expensive and bulky Lidars. However, this is not accessible to all due to it’s price. A camera (even depth) is much more affordable. However it comes with it’s own slew of disadvantages. It can see color but programming for the color is hard due to varying light conditions. Unless you use multiple cameras you often can’t see all around you. These factors together are a hindrance to autonomous navigation. We thus aim to demonstrate 3 goals: 1) Mapping and Localization with a single camera and other sensory information using RTABMAP SLAM algorithm 2) Obstacle avoidance and lane following with a single camera using Facebook AI Detectron2 Deep Learning and ROS 3) Tuning of the camera to be less sensitive to varying light conditions using ROS rqt_reconfigure

Data Visualizations and Interface For Autonomous Robots

  • Group members: Jia Shi, Seokmin Hong, Yuxi Luo

Materials:

Abstract: Autonomous navigation requires a wide-range of engineering expertise and a well-developed technological architecture in order to operate. The focus of this project and report is to illustrate the significance of data visualizations and an interactive interface with regards to autonomous navigation in a racing environment. In order to yield the best results in an autonomous navigation race, the users must be able to understand the behavior of the vehicle when training navigation models and during the live race. In order to address these concerns, teams working on autonomous navigation must be able to visualize and interact with the robot. In this project, different algorithms such as A* search and RRT* (Rapidly-exploring random tree) are implemented to create path planning and obstacle avoidance. Visualizations of these respective algorithms and a user interface to send/receive commands will help to enhance model testing, debug unexpected behavior, and improve upon existing autonomous navigation models. Simulations with the most optimal navigation algorithm will also be run to demonstrate the functionality of the interactive interface. Results, implications of the interface, and further improvements will also be discussed.

  • Group members: Shiyin Liang, Garrett Gibo, Neghena Faizyar

Materials:

Abstract: Self-driving vehicles are revolutionizing the automotive industry with companies like Tesla, Toyota, Audi and many more pouring a substantial amount of money into research and development. While many of these self-driving systems use a combination of cameras, lidars, and radars for local perception and navigation, the fundamental global localization system that they use relies upon a GPS. The challenge in building a navigation system around a GPS derives from the inherent issues of the sensor itself. In general, GPS’s tend to suffer from issues of signal interference that lead to infrequent positional updates and lower precision. On the 1/5th car scale, positional inaccuracies are magnified, so it is crucial that we know the location of our vehicle with speed and precision. In this project, we compare the performance of different GPS’s in order to determine what level of performance is best suited at the 1/5th scale. Using the best-suited GPS, we design a navigation system that can mitigate the shortcomings of the GPS and provide both a reliable autonomous vehicle.

Autonomous: Odometry and IMU

  • Group members: Pranav Deshmane, Sally Poon

Materials:

Abstract: For a vehicle to successfully navigate istelf and even race autonomously, it is essential for the vehicle to be able localize itself within its environment. This is where Odometry and IMU data can greatly support the robot’s navigational ability. Wheel Odometry provides useful measurements to estimate the position of the car through the use of wheel’s circumference and rotations per second. IMU, which stands for Interial Measurement Unit, is 9 axis sensor that can sense linear acceleration, angular velocity, and magnetic fields. Together, these data sources can provide us crucial information in deriving a Position Estimate (how far our robot has traveled) and a Compass Heading (orientation of the robot/where it’s headed). While most navigation stacks rely on GPS or Computer Vision to achieve successful navigation, this leaves the robot vulnerable to unfavorable scenarios. For example, GPS is prone to lag and may be infeasible in unfamiliar terrain. Computer Vision approaches often depend heavily on training data and cannot always provide continouos and accurate orientation. Odometry and IMU readings are thus invaluable sources of sensing information that can easily complement and enhance navigational stacks in place to build more robust and accurate autonomous navigation models.

Malware and Graph Learning


Malware Detection

  • Group members: Yu-Chieh Chen, Ruoyu Liu, Yikai Hao

Materials:

Abstract: As the technology grows fast in recent years, more and more people cannot live without cell phones. It is important to protect users’ data for cell phone companies and operating system providers. Therefore, detecting malwares based on the code they have can avoid publishing of malwares and prohibiting them from the source. This report aims at finding a model which can detect malwares accurately and with a small computational cost. It uses different matrices and graphs to search the relationships between applications and detecting malwares based on the similarity. As a result, the best model can achieve a test accuracy around 99%.

Potential Improvement of MAMADROID System

  • Group members: Zihan Qin, Jian Jiao

Materials:

Abstract: Nowadays, smartphone is an indispensable part of people's daily life. Android System is the most popular system running on smartphone. Due to this popularity, malware detection on Android becomes on of the most significant task for research community. In this project, we are mainly focusing on one called MAMADROID System. Instead of previous work which highly relied on the permissions requested by apps, MAMADROID relied on the sequences of abstracted API calls performed by apps. We are very interested in finding ways to improve this model. To achieve this, we've been trying to find some new features to fit into the model. We made three basic model and take the one with the highest accuracy and made two more advanced model based on this model with the best performance.

Exploring the Language of Malware

  • Group members: Neel Shah, Mandy Ma

Materials:

Abstract: The Android app store and its open-source features make it extremely vulnerable to malicious software, known as Malware. The current state of the art encompasses the use of advanced code analysis and corresponding machine learning models. Although along with our initial research we found that the Applications in the Android app store along with their corresponding API calls behave a lot like a language. They have their comparable own syntax, structure, and grammar. This inspired us to use techniques from Natural Language Processing(NLP) and use the same idea of creating graphical relationships between applications and APIs. Additionally, we also show that the use of these graphical embeddings maintains the integrity of classification metrics to even correctly identify and differentiate Malware and Benign applications.

CoCoDroid: Detecting Malware By Building Common Graph Using Control Flow Graph

  • Group members: Edwin Huang, Sabrina Ho

Materials:

Abstract: In today's world, malware has grown so much. In 2020, there are more than 129 millions of Android users around the world. With Android applications dominating the devices, we hope to produce a detection tool that is accessible to the general public. We present a structure that analyze apps in the form of control flow graph. With that, we build a common graph to capture how close the apps are to each other and classify whether they are malicious or not. We compare our work with other methods and show that using control flow graph is a good choice as a representation of Android applications (APKs) and can outperform other models. We built features using Metapath2Vec and Doc2Vec, and trained Random Forest, 1-Nearest Neighbors, and 3-Nearest Neighbors Models.

Attacking the HinDroid Malware Detector

  • Group members: Ruben Gonzalez, Amar Joea

Materials:

Abstract: Over the past decade, malware has established itself as a constant issue for the Android operating system. In 2018, Symantec reported that they blocked more than 10 thousand malicious Android apps per day, while nearly 3 quarters of Android devices remained on older versions of Android. With billions of active Android devices, millions of users are only a swipe away from becoming victims. Naturally, automated machine learning-based detection systems have become commonplace solutions as they can drastically speed up the labeling process. However, it has been shown that many of these models are vulnerable to adversarial attacks, notably attacks that add redundant code to malware to consfuse detectors. First, we introduce a new model that extends the Hindroid detection system by employing node embeddings using metapath2vec. We believe that the introduction of node embeddings will improve the performance of the model beyond the capabilities of HinDroid. Second, we intend to break these two models using a method similar to that proposed in the Android HIV paper. That is we train an adversarial model that perturbs malware such that a detector mislabels it as a benign app. We then measure the performance of each model after recursively feeding adversarial examples back into them. We believe that by doing so, our model will be able to outperform the Hindroid implementation in its ability to label malware even after adversarial examples have been added.

Text Mining and NLP


AutoPhrase Application Web

  • Group members: Tiange Wan, Yicen Ma, Anant Gandhi

Materials:

Abstract: We propose the creation of a full-stack website as an extension of the AutoPhrase algorithm and text analysis to help the non-tech users understand their text efficiently. Also, we provide a notebook with one specific dataset with text analysis to the users.

Analyzing Movies Using Phrase Mining

  • Group members: Daniel Lee, Yuxuan Fan, Huilai Miao

Materials:

Abstract: Movies are a rich source of human culture from which we can derive insight. Previous work addresses either a textual analysis of movie plots or the use of phrase mining for natural language processing, but not both. Here, we propose a novel analysis of movies by extracting key phrases from movie plot summaries using AutoPhrase, a phrase mining framework. Using these phrases, we analyze movies through 1) an exploratory data analysis that examines the progression of human culture over time, 2) the development and interpretation of a classification model that predicts movie genre, and 3) the development and interpretation of a clustering model that clusters movies. We see that this application of phrase mining to movie plots provides a unique and valuable insight into human culture while remaining accessible to a general audience, e.g., history and anthropology non-experts.

AutoPhrase for Financial Documents Interpretation

  • Group members: Joey Hou, Shaoqing Yi, Zachary Ling

Materials:

Abstract: The stock market is one of the most popular markets that the investors like to put their money in. There are millions of investors who participate in the stock market investment directly or indirectly, such as by mutual fund, defined-benefit plan. The performance of the stock price is highly related to the latest news information, such as the 8-K reports, the annual or quarter report. These reports reflect the operating performance of the companies, which are the important fundaments for the stock price. However, there are numerous news to the market for each day, and we want to build a model to extract the features from the news and use them to predict the price trend. In this project, we apply the AutoPhrase model from Professor Jingbo Shang to extract the high-quality phrases from the news documents and to predict stock price trends. We aim to explore if certain words or phrases correlate to higher or lower stock prices after a release of an 8-K report.

Text Classification with Named-Entity Recognition and AutoPhrase

  • Group members: Siyu Deng, Rachel Ung, Yang Li

Materials:

Abstract: Text Classification (TC) and Named-Entity Recognition (NER) are two fundamental tasks for many Natural Language Processing (NLP) applications, which involve understanding, extracting information, and categorizing the text. In order to achieve these goals, we utilized AutoPhrase and a pre-trained language NER model to extract quality phrases. Using these as part of our features, we are able to achieve very high performance for a five-class and a twenty-class text classification dataset. Our project will follow a similar setting as previous works with train, validation, and test datasets and comparing the results across different methods.

  • Group members: Bingqi Zhou, Jiayi Fan, Yichun Ren

Materials:

Abstract: When encountering scientific papers, it is challenging for readers themselves to find other related works. First of all, it is hard to identify keywords that summarize the papers to search for similar papers. This dilemma is most common if readers are not familiar with the domains of papers that they are reading. Meanwhile, traditional recommendation models based on user profile and collection data are not applicable for recommending similar works. Some existing digital libraries’ recommender systems utilize phrase mining methods such as taxonomy construction and topic modeling, but such methods also fail to catch the specific topics of the paper. AutoLibrary is designed to address these difficulties, where users can input a scientific paper and get the most related papers. AutoLibrary solves the dilemma via a text analyzer method called AutoPhrase. AutoPhrase is a domain-independent phrase mining method developed by Jingbo Shang et al. (2018) that can automatically extract quality phrases from the input paper. After users upload the paper and select the fields of study of the paper, AutoLibrary utilizes AutoPhrase and our pre-trained domain datasets to return high-quality domain-specific keywords that could represent the paper. While AutoLibrary uses the top three keywords to search on Semantic Scholar for similar works at first, users could also customize the selection of the high-quality phrases or enter their own keywords to explore other related works. Based on the experiments and result analysis, AutoLibrary outperforms other similar text analyzer applications efficiently and effectively across different scientific fields. AutoLibrary is beneficial as it eases the pain point of manually extracting accurate, specific keywords from papers and provides a personalized user experience for finding related papers of various domains and subdomains.

Restaurant Recommender System

  • Group members: Shenghan Liu, Catherine Hou, Vincent Le

Materials:

Abstract: Over time, we rely more and more heavily on online platforms such as Netflix, Amazon, Spotify, which are embedded with the recommendation system in the applications. They know users’ preferences by collecting their ratings, recording the clicks, combing the reviews and then recommending more items. In building the recommender system, review texts can hold the same importance as the numerical statistics because they contain key phrases that characterize how they felt about the review. For this project, we propose to build the recommender system with primary focus on the text reviews analysis through TF-IDF (term frequency-inverse document frequency) and AutoPhrase and to add targeted segmented analysis on phrases to attach sentiments to aspects of a restaurant to rank those recommendations. The ultimate goal is designing a website for deploying our recommender system and showing its functionality.

Recommender Systems


ForumRec - A Question Recommender for the Super User Community

  • Group members: Yo Jeremijenko-Conley, Jack Lin, Jasraj Johl

Materials:

Abstract: The Super User forum exists on the internet as a medium for users to exchange information. In particular, the information shared here primarily related to questions pertaining to operating systems. The system we developed, ForumRec, aims to increase usability for the forum’s participants by specifically recommending questions that may be more suitable for a user in particular to answer. The model we built uses a combination of technique content-based and collaborative filtering from the LightFM package to identify how well a novel question would fit for the desired user. In comparison to baseline models of how Super User already recommends questions, the model attains better performance for more recent data, scoring 0.0014, 0.0033, and 0.5160 on precision at 100, recall at 100, and AUC, which is markedly better than the baselines.

OnSight: Outdoor Rock Climbing Recommendations

  • Group members: Brent Min, Brian Cheng, Eric Liu

Materials:

Abstract: Recommendations for outdoor rock climbing has historically been limited to word of mouth, guide books, and most popular climbs. We aim to change that with our project OnSight, offering personalized recommendations for outdoor rock climbers.

Bridging the Gap: Solving Music Disputes with Recommendation Systems

  • Group members: Nayoung Park, Sarat Sreepathy, Duncan Carlmark

Materials:

Abstract: Many have probably found themselves in an uncomfortable conversation in which a parent is questioning why the song playing over a bedroom speaker is so loud, repetitive, or profane. If someone has never had such a conversation, at the very least they have probably made a conscious decision to refrain from playing a certain genre or artist when their parents are around. Knowing what music to play in these situations does not have to be an elaborate, stressful process. In fact, finding appropriate songs can be made quite simple with the help of recommendation systems. Our solution to this issue actually consists of two recommendation systems that function in similar ways. The first takes music that parents enjoy and recommends it to their children. The second takes music that children enjoy and recommends it to their parents. Both these recommendation systems create their own individual Spotify playlists that try to “bridge the gap” between the music tastes of parents and their children. Through user testing and user interviews we found that our recommenders had mixed success in creating playlists that could be listened to by children and their parents. The success of our recommendations seemed to be largely correlated with the degree of inherent similarity between the music tastes of children and their parents. So while our solution is not perfect, in situations where overlap between parents and children exist, our recommender can successfully “bridge the gap”.

Asnapp - Workout Video Recommender

  • Group members: Peter Peng, Najeem Kanishka, Amanda Shu

Materials:

Abstract: For those who work out at home, finding a good workout routine is difficult. Many of the workout options you may find online are non-personalized and do not take into account your time and equipment constraints, as well as your workout preferences. Asnapp is a web application that provides personalized recommendations of workout videos by Fitness Blender, a company that provides free online workout videos. Our website displays several lists of recommendations (similar to Netflix’s user interface), such as “top upper body workouts for you”. Users can login into our website, choose between several models to generate their recommendations, browse through personalized recommendations lists, and choose a workout to do, saving them the time and effort needed to build a good workout routine.

Plates4U

  • Group members: Zachary Nguyen, Alex Pham, Anthony Fong

Materials:

Abstract: Existing options for recipe recommendations are less than satisfactory. We sought to solve this problem by creating our own recommendation system hosted on a website. We used Food.com recipe data to create a classifier to identify cuisines of recipes, a popularity based recommender, and a content-based filtering recommender using cosine similarity. In the future, we would like to improve upon this recommender by exploring alternative ways to model ingredients, try to start tracking implicit/explicit data of a user, and try to create a hybrid recommender using collaborative techniques.

Makeup Recommender

  • Group members: Justin Lee, Shayal Singh, Alexandria Kim

Materials:

Abstract: Although product recommenders are conventional in the world of machine learning based recommender systems, cosmetics are an overlooked field. By providing a complete set of cosmetic recommendations, we can reduce the time and effort required for users to find the best products for a user’s personalized needs. Our goal is to create a recommender that will provide a one-stop shop experience where a user will get recommended an array of products to create an entire makeup look based on similar products that the user enjoys, products that similar users have purchased, as well as products that are personalized to the user including skin type, skin tone, ingredient preferences, and budget. The website recommends a complete makeup set personalized to the user. The user inputs their skin type, skin tone, budget, and any ingredient preferences so that we can suggest the best products for their personalized needs. The user also inputs a product of their choice from one of the four categories to aid with further personalization. Using this preference and knowledge about the user, we will suggest a complete set of products to complete a look. Our recommender provides four categories of products: face, cheeks, eyes, and lips. Our project aims to utilize collaborative filtering recommendations to ensure user satisfaction and success when creating their desired look.

Opioid Overdose Prevalence Analysis


Opioid Overdose Prevalence Analysis

  • Group members: Gunther Schwartz, Flory Huang, Hanbyul Ryu

Materials:

Abstract: Substance abuse is not only a significant health hazard of epidemic proportions, it is also a large marketplace where addictive substances, abuse and co-abuse patterns, supply and demand patterns, Governmental regulations enforcement -- all play a role. The interplay of these factors change significantly when exceptional events like the COVID-19 pandemic strikes. The proposed Capstone research will develop a knowledge graph based approach to compare pre-pandemic and in-pandemic dynamics. It will combine Information Integration, Natural Language Processing and Machine Learning techniques to automatically construct a Knowledge Graph by fusing information from Governmental, News and Social Media data.

Large-scale Multiple Testing


Multiple Testing Method with Empirical Null Distribution in Leukemia Studies

  • Group members: Raymond Wang

Materials:

Abstract: In genomics we are often faced with the task to identify genes correlated with a specific disease among a large number of candidate gene pools. A naive approach is to apply a hypothesis test to every individual gene. This method ignores confounding factors in the data and does not adjust for the additional variance. In this paper we will introduce a much more robust method primarily using estimations of the empirical null distribution and the false discovery rate (FDR). A leukemia dataset is used to demonstrate that the empirical null distribution, one estimated from observing the data first, provides a better fit of the theoretical null distribution. Furthermore, we will compare and contrast the result with unsupervised classification methods such as k-Means and the Gaussian Mixture Model.

Large-scale Multiple Testing with Empirical Null Distribution in Predicting Cardiovascular Disease

  • Group members: Zimin Dai, Leyang Zhang, Wentao Chen

Materials:

Abstract: According to the World Health Organization, cardiovascular disease, such as ischemic heart disease and stroke, is the leading cause of deaths globally. We use features and some health conditions of a person to determine the signals and the probability of whether he/she has cardiovascular disease. To achieve such a goal, we implemented logistic regression and large-scale multiple testing methods on a dataset with ample information. Empirical Null is applied to find outliers and filter the dataset, we successfully removed about 10000 outliers out of the 70000 observations by applying the FDR and FPR method to find the upper and lower bound. Finally, our product was a Logistic Regression model that predicts whether a person has Cardiovasculardisease with an accuracy score of .7220 on the test set.

Spatial-temporal Analyses of Infectious Disease Dynamics


Spatial-temporal Prediction of COVID-19 Case Counts Through Epidemiology Model

  • Group members: Shuyuan Wang

Materials:

Abstract: Epidemiology Model alone is not sufficient to account for the complexity of COVID-19 and thus when implemented alone, often gives very inaccurate predictions. One reason is that the model itself does not take into consideration of the spatial aspect of the region. It assumes the region to be isolated from any other regions. In reality, there is a lot of traffic going back and forth among countries at the boundaries. In this project, my team adjusts the Epidemiology Model to account for the spatial aspect of the disease in order to predict case counts of Californian counties. We included the adjacency, the distance between counties, and the mobility. After fitting the original Epidemiology Model through Gradient Approach with learning rates adjusted by Hessain Matrix to found the infection duration and infection rate for each county, we implemented geographical information, including adjacency and distance among the Californian counties. The spatial model also takes into account the mobility score of each county, that is, how fast people are moving around. To test out the model, we performed predictions on 3/2/2021 based on the previous day's (3/1/2021) case counts with dt as 5. The infection duration and infection rate are based on previous 40 days. Most of the counties yield less than 1% error. However, 5 counties have inaccurate predictions because of the missing data on its neighbors due to being on the edge of California, or low population. The model currently has not been tested out on the entire United States due to a lack of computing speed and missing data which will result in many counties without any neighbors. In the future, the model can be extended to predict 3 days in advance or more days to generate more value. Furthermore, the infection duration and infection rate are derived still based on the original model. In the future, we wish to use Gradient Descent to acquire infection duration and infection rate dynamically from the new model.

Graph Data Analysis


A Graph ML Analysis of Senatorial Twitter Accounts

  • Group members: Yueting Wu, Yimei Zhao, Anurag Pamuru

Materials:

Abstract: This project’s main inquiry is into whether there is a tangible difference in the way that Democrat members of Congress speak and interact on social media in comparison to Republican members of Congress. If there are such differences, this project will leverage them to train a suitable ML model on this data for node classification. That is to say, this project aims to determine a Senator’s political affiliation based off of a) their Twitter relationships to other Senators b) their speech patterns, and c) other mine-able features on Twitter. In order to truly utilize the complex implicit relationships hidden in the Twitter graph, we can use models such as Graph Convolutional Networks, which apply the concept of “convolutions” from CNNs to a graph network-oriented framework. These GCNs learn feature representations for each node in the Twitter graph and utilize those representations to fuel the aforementioned node classification task. However useful the GCN may be, there is no shortage of other graph ML techniques that could lend themselves to the prediction task at hand. Of particular interest are inductive graph ML techniques; inductive Graph Networks are a new assortment of Graph Networks that no longer need to be trained on a whole graph to get feature representations for all nodes in the dataset (transductive). Instead, inductive techniques like GraphSage peer into the structural composition of all the nodes in a graph by building neighborhood embeddings for each node. By using a medley of networks on this dataset, we gain deeper insight into what kind of graph we are working with. In other words, if more complex techniques like GraphSage outrank vanilla GCNs, it would point to an equally complex structural composition within the graph that only an inductive technique like GraphSage would be able to pinpoint. However, it is harder to train any network without features. In the case of our analysis, these features will be some text embedding of a politician's tweets. Solutions like word2vec or even a sentiment analysis metric that aggregate across the hundreds of thousands of tweets posted by the 116th Congress could prove quite useful as features for the training of the aforementioned models.

GCN on 3d Points

  • Group members: Shang Li, Xinrui Zhan

Materials:

Abstract: This research focuses on 3D shape classification. Our goal is to predict the category of shapes consisting of 3D data points. We aim to implement Graph Neural Network models and compare the performances with the PointNet, a popular architecture for 3d points cloud classification tasks. Not only will we compare standard metrics such as accuracy and confusion matrix, we will also explore the model's resilience on data transformation. What’s more, we tried combining PointNet with graph pooling layers. Our experiment shows that even though PointNet has a higher accuracy overall, GCN has much more reasonable misclassification and is much more robust to data augmentation.

Graph-Based Product Recommendation

  • Group members: Abdullatif Jarkas, Nathan Tsai

Materials:

Abstract: Recommender systems are important, revenue-generating technologies in many of the services today, providing recommendations for social, product, and other networks. However, the majority of existing recommender system methods use metrics of similarity to recommend other nodes through content-based and collaborative filtering approaches, which do not take into account the graph structure of the relationships between the nodes. A graph-based recommender system then is able to utilize graph relationships to improve node embeddings for recommendation in a way that conventional recommender systems cannot. Inspired by PinSage, we explore an unsupervised graph-based recommendation method that can take advantage of the relationships between nodes, in addition to the text and image features, and generate more accurate and robust embeddings for Amazon product recommendation.

NBA Game Prediction

  • Group members: Austin Le, Mengyuan Shi

Materials:

Abstract: When working with an NBA dataset, we wanted to figure out the best way to represent a network like structure amongst the teams and figured that the amount of time that each player spends on the court with one another would prove useful. By extracting this network and projecting player statistics upon each node, we will utilize GraphSage, a framework that will embed node features onto each player and aggregate each team to predict whether or not they can make the playoffs.

NBA Seeds with Graph Neural Networks

  • Group members: Steven Liu, Aurelio Barrios

Materials:

Abstract: The NBA contains many challenges when attempting to make predictions. The performance of a team in the NBA is difficult because many things can happen over the course of 81 games. Our analysis attempts to produce accurate results by exploiting the natural structure of the NBA league and data of previous player stats. Our analysis begins with identifying the players on each roster to create an aggregated stat for each team, then we will take advantage of the schedules of each team to learn the unique performance of a team against every other team. Leveraging the features and the schedule of the teams, we expect to be able to make decent predictions of NBA seedings before a season starts.

Stock Market Sentiment Predictor

  • Group members: Jason Chau, Sung-Lin Chang, Dylan Loe

Materials:

Abstract: In this project, we aim to produce a tool that will be able to predict the stock movement of a company. The output will be a binary output where it will indicated whether we are bullish or bearish on a stock. In our pursuit of making this tool, we will incorporate graph convolutional networks to take advantage of the interconnected features of stocks.

The Spread of Misinformation


Political Popularity of Misinformation

  • Group members: Catherine Tao, Aaron Chan, Matthew Sao

Materials:

Abstract: For our research on Political Popularity of Misinformation, we want to research the influence politicians have on Twitter, a well known social media platform for users to voice their opinions to a wider audience. The information shared on Twitter that we are interested in will be grouped into scientific information or misinformation. Politicians can easily sway public opinion with a simple tweet, therefore we wanted to analyze how much they influence other Twitter users. We gathered ten politicians who we considered to spread scientific information on Twitter and ten politicians who we considered to spread misinformation on Twitter. We analyze the two groups to show how controversial a tweet appears. We do this by looking at tweet engagement as well as a popularity metrics to see growth over time. The results of our investigation showed that politicians who spread misinformation have a higher ratio value on average and have less overall likes over their tweets. Our permutation tests shows that our scientific group has been consistently growing and increasing in growth over time. In contrast, our misinformation group has grown significantly, but only in the more recent years. Overall, our results show that a politician can experience the most growth through spreading non-controversial, scientific information.

The Sentiment of U.S. Presidential Elections on Twitter

  • Group members: Zahra Masood, Sravya Voleti, Hannah Peterson

Materials:

Abstract: Political tensions in the United States came to a head in 2020 as the public responded to various major events such as the onset of the COVID-19 pandemic and the murder of George Floyd, as well as the 2020 presidential election. Here we investigate if there is evidence of increasing polarization and negativity in regards to politics among the American public on social media by analyzing Twitter data related to the 2016 and 2020 presidential elections. Using publicly available datasets of tweets for each election, we perform sentiment analysis on the text of tweets to quantify their degrees of negativity and subjectivity. We also identify political leanings of tweets by analyzing their hashtag usage and identify “dialogue” occurring between and amongst left- and right-leaning users by analyzing the tweets’ user mentions. We then conduct permutation testing on these various groupings of tweets between the two years to determine if there is statistical evidence of increased polarization and negativity on social media surrounding the U.S. presidential election from 2016 to 2020, both generally and between and within political parties. We find that election-related tweets in 2020 generally used less neutral language than in 2016 but were not conclusively more positive or negative in sentiment.

Community Effects From Misinformation Flags on Twitter

  • Group members: Nigel Doering, Tanuj Pankaj, Raechel Walker

Materials:

Abstract: Recent events including the 2016 election, COVID-19 pandemic, the 2020 election, and the development of a COVID-19 vaccine has laid bare the essential need to prevent misinformation from spreading uncontrollably on social networks. Social media companies have developed systems for preventing the further spread of misinformation. Most notably, some companies have begun placing flags that warn a user of the misinformative content of the post. Research has addressed a way to analyze Twitter users on how conservative versus liberal, moderate versus extreme, and pro-science versus anti-science they are based on their tweet history. We detail a novel machine learning approach to classify users based on three similar dimensions. We then conduct an analysis comparing Twitter users who retweeted flagged tweets versus those who retweeted unflagged tweets, with the tweets coming from high profile conservative Twitter users, such as Eric Trump. Results from the analysis suggest that users who are sharing these flagged tweets tend to be slightly more liberal and more moderate than the users who are sharing unflagged tweets. We propose possible explanations, as well as future work to better understand the impact of misinformation flags.

Political Polarization of Major News Networks on Twitter

  • Group members: Christopher Ly, Shutong Li, Mark Chang

Materials:

Abstract: We will construct a geometric definition of the political spectrum of major US news outlets through an unsupervised approach. We model the political alignments of the outlets in terms of pairwise political similarity among pairs of outlets using graphs and embed the graph onto a Euclidean space for result. We will be collecting hashtags used in the users' own timelines and cross reference it with the hashtags used during election period to classify their political stance as well as create a graph analysis between the news networks as a whole. Through this, we demonstrate the location where each news network lies on the U.S. political spectrum and how they lie relative in hashtag vector space to one another.

Twitter’s Impact on Elections

  • Group members: Prem Pathuri, Zhi Lin

Materials:

Abstract: The rise of social media has dominated every aspect of our daily lives. In the height of the 2020 presidential election and as COVID-19 rampaged throughout the world, it facilitated increased online discussion, as well asthe spread of information and misinformation. This project investigates the relationship that discussion on social media has with election outcomes. It finds that in comparing two distinct presidential elections, both of whichtook place as Twitter usage grew steadily, increased discussion levels were present in a Democratic win of the election.

Analyzing the the Diffusions of Various Forms of Misinformation on Reddit

  • Group members: Cindy Huynh, Hasan Liou, Helen Chung

Materials:

Abstract: Misinformation has taken social media by storm. It reaches every corner of these kinds of platforms, from topics like the existence of aliens or even contesting the outcomes of a presidential election. The consequences of such viral misleading content is disruptive, and we are just beginning to see how devastating these effects can be in real time. We look into the diffusion of misinformation on Reddit, specifically how users within specific subreddits behave and interact with one another. We look at three categories of subreddits that we had selected: ones regarding scientific information, political misinformation, and urban myth misinformation. We analyze these three categories, analyzing how they intersect with each other or the lack thereof. We utilize user polarities, which is defined as how “loyal” a user may be to one category of subreddits compared to the other two categories on a number scale of 0 to 1. We conclude that there is the existence of echo chambers in the categories that we had looked at, and that the users within these respective categories behave differently from one another.

COVID-19 Sentiment and Daily ases Analysis on Social Media

  • Group members: Jiawei Zheng, Yunlin Tang, Zhou Li

Materials:

Abstract: With the unexpected impact of Covid-19, drastic changes were induced to people’s health, lifestyle, and mentality. During the research last quarter, we noticed that the majority of posts in our Twitter dataset have strong emotions and sentiments. In this project, we trained our SVC tweet sentiment model using a dataset that contains 1.6 million data with text and sentiment labels from Kaggle. The trained model is used to predict sentiment scores on the daily tweets sampled from the Panacea Lab dataset. After that, we detrended the daily case data and performed multiple analyses including correlation, cointegration test, and Fourier transformation to study its relationship with the sentiment score.

Conflict and Collaboration in Online Communities


Controversy In Wikipedia Articles

  • Group members: Hengyu Liu, Xiangchen Zhao, Xingyu Jiang

Materials:

Abstract: There are “wars” going on every day online, but instead of cities, they are defending their options and perspects. This phenomenon is especially common on the Wikipedia platform where users are free to edit others' revisions. In fact, there are “about 12% of discussions are devoted to reverts and vandalism, suggesting that the WP development process is highly contentious.” (Robert 1) As Wikipedia has become a trusted source of information and knowledge which is freely accessible, It is important to investigate how editors collaborate and controvert each other in such a platform. This paper will discuss a new method of measuring controvisality in Wikipedia articles. We have found out that controversiality is highly related to the number of revert edits, the sentiment level among one article comments, and the view counts of that article. Thus we developed a weighted sum formula, which combines those three factors to accurately measure the controversy level within articles in Wikipedia.

The Large-Scale Collaborative Presence of Online Fandoms

  • Group members: Casey Duong, Kylee Peng, Darren Liu

Materials:

Abstract: Fan communities exist within every industry, and there has been little study on understanding their scale and how they influence the media and their industries. As technology and social media have made it easier than ever for fans to connect with their favorite influencers and find like-minded fans, we’ve seen a rise in fan culture or “fandom”. These individuals form fan groups and communities, which have become increasingly popular online and have rallied behind their favorite artists for different causes. In recent years, K-pop has taken the music industry by storm, quickly rising to global significance and gathering some of the most dedicated fanbases in the world. We explore the similarities and differences in collaboration efforts among fans of three popular artists, BTS, Taylor Swift, and Justin Bieber on two primary online social platforms, Twitter and Wikipedia. We present a new method to quantify the strength and influence of online fan communities—with a focus on the BTS fanbase—and how this online collaboration affects outside audiences.

Wikipedia’s Response to the COVID-19 Pandemic

  • Group members: Michael Lam, Gabrielle Avila, Yiheng Ye

Materials:

Abstract: Through collaborative efforts online, Wikipedia has always been at the forefront of providing information to the public on almost any topic, including a pandemic. Covid-19 has been one of the most relevant topics of 2020 and still remains so as of right now, therefore gathering as much information as possible is essential for the world to combat such a virus. Many official health sources online provide such knowledge with the resources that they have, but false or outdated information can spread quickly. In this article, we perform EDA and LDA on different Wikipedia articles related to coronavirus and compare the results to the word clouds of traditional sources to explore how Wikipedia can provide reliable and updated details and data about Covid-19.

A Study of LGBTQ+ Wikipedia Articles Sentiment over Time

  • Group members: Henry Lozada, Emma Logomasini, Parth Patel, Yuanbo Shi

Materials:

Abstract: We detail a specific method that determines how, if at all, sentiment changes over time for a category of Wikipedia articles, which, in our study, are articles categorized by Wikipedia as LGBT articles. This method uses three different sentiment analyzers, one for each of the three different language editions of Wikipedia we are analyzing, to calculate the sentiment of a Wikipedia article, doing so for all edits in the article's revision history and for all articles in each language's LGBT category. This enables us to calculate a fixed effects regression for each language's sentiment scores, allowing us to determine whether or not time has a positive effect on the articles' sentiment scores, as well as to compare these trends across languages.

Politics on Wikipedia

  • Group members: Cameron Thomas, Iakov Vasilyev, Joseph Del-Val

Materials:

Abstract: This paper seeks to analyze the degree and prevalence of political bias and controversy in Wikipedia. Using pre-trained models from Rheault and Cochrane (2019) and Shapiro and Gentzkow (2019) we validate our methods for generalizability on the ideological books corpus (Sim et al., 2013) with sub-sentential annotations (Iyyer et al., 2014) and attempt to apply these methods to receive insight into political bias in Wikipedia. We attempt to combat overlap in political slants and avoid labeling political bias whose detection is unavoidable due to the topic of the article in question. With insight into political bias on Wikipedia gained we hope it will be able to prove useful in combating counterproductive activity on Wikipedia and allow for more precise and targeted activity by Wikipedia monitors.

Genetic Basis of Mental Health


Differential Gene Expression Analysis of Human Opioid Abusers

  • Group members: Cathleen Pena, Dennis Wu, Zhaoyi Guo

Materials:

Abstract: Opioid abuse is a serious national crisis. The opioid epidemic is unique and important because 21-29% of patients who are prescribed opioids end up misusing them [16]. Opioids increase the amount of dopamine made in the brain by increasing dopamine-synthesizing neurons- making them very easy to become addicted to. This study explores the long-lasting changes in gene expression that may contribute to addiction, cravings, and relapse by studying subjects who continuously used opioids. Using DESeq2 [7] and WGCNA [8], our analyses identified differentially expressed genes by finding which genes were up-regulated and down-regualted. We also found distinct gene networks associated with opioid abuse. Overall, 28 genes were found to be down-regualted and 16 genes were up-regulated. The opioid-regulated genes identified from our project could possibly serve as new therapeutic targets to help combat opioid addiction.

The Genetic Basis of Antibiotic Resistance in E. Coli

  • Group members: Jiayi Wu, Alan Chen, Myra Haider

Materials:

Abstract: One of the greatest challenges in public health is the growing number of bacterial species that have developed resistance to antibiotics through point mutations in the genome. Our project aims to identify these genetic markers of antibiotic resistance through a genome wide association study between 36 antibiotic resistant E.Coli samples and 36 controls. Variants were identified in both groups, checked for statistical significance, and analyzed for any functional effects.

Blood-based Analysis of Alzheimer's Disease from miRNA Data

  • Group members: Gregory Thein, Justin Kang, Ryan Cummings

Materials:

Abstract: Alzheimer’s Disease (AD) is an irreversible, progressive neurodegenerative disorder that slowly destroys a person's cognitive and physical abilities. The cause of AD is unclear, but is believed to be a combination of genetic, environmental and lifestyle factors. Because the only way to definitely diagnose AD is post mortem, the search for earlier definitive detection is crucial. One way of doing this is by analyzing blood samples to detect biomarkers and microRNAs. A biomarker is defined as a characteristic that is objectively measured as an indicator of normal biological processes, while microRNAs (miRNAs) are non-coding RNA molecules that are involved in the regulation of gene expression. Recent studies show miRNAs and biomarkers as possible tools for AD diagnosis, thus, leading us to analyze blood miRNA data for our study. Utilizing influences from various other studies, we examined 70 blood samples of AD and controlled patients through our custom genetics pipeline in hopes of a breakthrough in understanding the pathology of the disease. We then implemented two different statistical tests, a non-parametric hypothesis test (Wilcoxon-Mann-Whitney Test) and a parametric hypothesis t-test (DESeq2). From these tests we were able to isolate nine significant samples to perform further analysis on its relationship and effect to AD.

Comparison of Differential Gene Expression Analysis Tools

  • Group members: Brandon Tsui, Joseph Bui, Weijie Cheng

Materials:

Abstract: RNA-Seq (named as an abbreviation of "RNA sequencing") is a technology-based sequencing technique which uses next-generation sequencing (NGS) to reveal the presence and quantity of RNA in a biological sample at a given moment, analyzing the continuously changing cellular transcriptome. Differential expression analysis takes the normalized read count data and performs statistical analysis to discover quantitative changes in expression levels between experimental groups. As technology progresses by each year passing, there are now a lot of technological tools available in the Internet that can perform such differential expression analysis. The purpose of our project is to take a closer look at some of these tools, and compare their performances to understand which tools are optimal to utilize. Specifically, the softwares that we are going to focus on are: ABSSeq1, voom.limma2, PoissonSeq3, DESeq24, NOISeq5, ttest6 and edgeR7. We are going to compare their performances by looking at parameters such as Area Under the Curve (AOC), False Discovery Rate (FDR), Type I error rate, and Sensitivity.

Genetic Overlap between Alzheimer's, Parkinson’s, and healthy patients

  • Group members: Xuanyu Wu, Justin Lu, Saroop Samra

Materials:

Abstract: Our research compares overlapping patterns in miRNA between patients with Alzheimer's and Parkinson’s across two biofluids, cerebrospinal fluid and serum, by pinpointing significant transcription factors that the diseases share. We hope the results of our gene analysis can be leveraged by researchers to help alleviate the effects of the disorders and potentially develop medicines and therapies that target these genes.

VPN X-Ray


Live vs. Video on Demand Inside VPN Detection

  • Group members: Da Gong, Mariam Qader, Andrey Pristinsky, Tianran Qiu, Zishun Jin

Materials:

Abstract: Due to the variety, affordability and convenience of online video streaming, there are more subscribers than ever to video streaming platforms. Moreover, the decreased operation of non-essential businesses and increase in the number of people working from home in this past year has further compounded this effect. More people are streaming live lectures, sports, news, and video calls via the internet at home today than we have ever seen before. In March 2020, Youtube saw a 2.5x increase in the amount of time people spent streaming live video [1]. Twitch more than doubled their hours of content in three months after the start of the pandemic [1]. There is a huge boom in the video content world, and it does not seem to be slowing down anytime soon. Internet Service Providers, such as Viasat, are tasked with optimizing internet connections and tailoring their allocation of resources to fit each unique customer’s needs. With this increase in internet activity, it would be especially beneficial for Viasat to understand what issues arise when customers stream various forms of video. When a user has difficulties with their internet connections, ISP’s want to be able to understand their activity to give potential reasons to why the problem occurred and a quick solution.

DANE: Data Automation and Network Emulation Tool

  • Group members: Danial Yaseen, Sahil Altekar, Parker Addison

Materials:

Abstract: In the field of network traffic research, datasets are often manually generated with in-house methods using fast internet connections. This creates a data representation issue, as we can’t expect all internet users to have great network conditions. How can we make sure network research is taking diverse network conditions into account? Is there a better way to generate traffic datasets with representative network conditions? DANE is a hackable and automated dataset generation tool which collects traffic data in a variety of configurable network environments. In our talk we introduce the tool, the purpose it serves, and how it works. Finally, we dive into an example of real-world analysis using data collected by our tool.

Res Recovery: Classifying Video Resolutions Through a VPN Tunnel

  • Group members: Samson Qian, Shrimant Singh, Soon Shin, Iman Nematollahi, Stephen Doan

Materials:

Abstract: Virtual private networks, or VPNs, have seen a growth in popularity as more of the general population has come to realize the importance of maintaining data privacy and security while browsing the Internet. In previous works, our domain developed robust classifiers that could identify when a user was streaming video. As an extension, our group has developed a Random Forest model that determines the resolution at the time of video streaming.

SPICIVPN

  • Group members: Arely Vasquez, Chang Yuan, Raimundo Castro, Jerry Qian, Molly Rowland

Materials:

Abstract: Whether to access another country's Netflix library or for privacy, more people are using Virtual Private Networks (VPN) to stream videos than ever before. However, many of the different service providers offer different user experiences that can lead to differences in the network transmissions. In this project we will discuss the methods in which we made a classifying model to determine what streaming service provider was being used over a VPN. The streaming providers that the model identifies are Amazon Prime, Youtube, Netflix, Youtube Live and Twitch. This is valuable in understanding the differences in the network work patterns for the different streaming service providers. Across all providers, our Random Forest model achieves a 96.5% accuracy in provider classification.

Particle Physics


Interpreting Higgs Boson Interaction Network with Layerwise Relevance Propagation

  • Group members: Alex Luo, Cecilia Xiao

Materials:

Abstract: While graph interaction networks achieve exceptional results in Higgs boson identification, GNN explainer methodology is still in its infancy. To introduce GNN interpretation to the particle physics domain, we apply layerwise relevance propagation (LRP) to our existing Higgs boson interaction network (HIN) to calculate relevance scores and reveal what features, nodes, and connections are most influential in prediction. We call this application HIN-LRP. The synergy between the LRP interpretation and the inherent structure of the HIN is such that HIN-LRP is able to illuminate which particles and particle features in a given jet are most significant in Higgs boson identification. The resulting interpretations are ultimately congruent with extant particle physics theory, with the model demonstrably learning the importance of concepts like the presence of muons, characteristics of secondary decay, and salient features such as impact parameter and momentum.

Deep Learning for Particle Jet Multiclassification

  • Group members: Nathan Roberts, Darren Chang, Sharmi Mathur

Materials:

Abstract: As data scientists, we are often driven toward those domains which generate vast amounts of data. High-energy physics is no exception. The Large Hadron Collider (LHC) alone produces around 90 petabytes of data per year (roughly 240 terabytes per day). As such, there are thousands upon thousands of researchers combing through the LHC’s particle interactions to draw conclusions. But, there exists one major difficulty in doing so: the colliders themselves only have instruments that can detect physical quantities (energies, momentums, and the like). The LHC simulates particle collisions that result in a spray of subatomic particles called jets. Considering the many categories of jets (Higgs boson, singly charmed quarks, etc.), classification of jets must be conducted outside of the LHC by researchers and their algorithms. We implement multiple multiclass classifiers (CNN, GNN, ENN) to discriminate between six types of jets which may occur. While a similar classifier exists at the LHC, the ceiling for improvement extends higher with each advancement in machine learning- deep neural network architecture being the most recent. In implementing our own neural network, we strive to achieve a higher level of model performance.

COVID-19 & Microbiome


RTL Automation

  • Group members: Richard Duong, Nick Lin, Yijian Zong

Materials:

Abstract: The RTL Automation aims to build data pipelines, automate the major workflow of the RTL, and free researchers from doing manual chores like updating Google Sheet or drop in csv files. Simply put, we make researchers' lives easier and help them more efficiently end this pandemic.

Cyber-Physical Systems (CPS) using IOT Devices


AUTOBRICK: A system for end-to-end automation of building point labels to Brick turtle files

  • Group members: Advitya Gemawat, Devanshu Desai

Materials:

Abstract: BRICK is a schema for representing various building equipment, including but not limited to, HVAC air handling units and carbon dioxide sensors in different rooms. While the schema is a clear step up over the current state-of-the-art, its potential is severely hindered because it is not backwards compatible. This means that converting CSV files storing building data to a BRICK-compatible data format is a cumbersome and imperfect process as different systems use different conventions to denote the same systems. This conversion usually required human involvement until now. AUTOBRICK is a software tool that automates this conversion with minimal human intervention and provides an order of magnitude greater speed up (90x) over the current state of the art.

Airborne Infection Risk Estimator for COVID-19

  • Group members: Etienne Doidic, Zhexu Li, Nicholas Kho

Materials:

Abstract: The global pandemic of COVID-19 has demonstrated the exceptional transmissibility of the SARS-CoV-2 virus and has highlighted the vulnerability of our built environments to similar airborne pathogens. The traditional process for retrieving information in the target area and making estimations is quite complicated and involves a lot of manual work. In order to bring a convenient and comprehensive view of important information and estimations in interested zones, our team has developed an open source app which could be easy enough to be used by the everyday consumer or as detailed as a building manager needs it to be.

System Usage Reporting (SUR, a.k.a. DCA)


Mouse Wait Classification

  • Group members: Pan Yeung, Sijie Mei, Yingyin Xiao

Materials:

Abstract: This thesis describes an study of machine learning and its application to mouse wait time in computers. Specifically, we are building a classification model of mouse wait time based on dynamic and static system information within the 2020 time interval to classify if a mouse wait event would last within 0-5 secs, 5-10 secs, or 10+ secs. Dynamic system information, such as CPU utilization, is subject to the configuration of each system. Therefore, by incorporating static system information which includes the computer configuration of each system into the model, we could significantly improve the accuracy of the prediction. Currently, the model reaches an accuracy of 70\% with Decision Tree Classifier.

  • Group members: Jinzong Que, Yijun Liu, Kaixin Huang

Materials:

Abstract: Our goal for this project is to understand and discover features that affect the battery’s estimated remaining time. Through our exploratory data analysis, we have discovered eight features, namely the number of devices, number of processes, average memory, average page faults, designed capacity, cpu percentage, cpu seconds, and cpu tempera- ture. Using these eight features, we decided to come up with several different models, Linear Regression, Decision Tree Regressor, SVM, Random Forest Regressor, Adaboost Regressor, Gradient Boosting Regressor and Bagging Re- gressor. To understand which model performs the best given these features, we performed hypothesis testing. In the end, our results show that Gradient Boosting Regressor performs the best out of all in that the maes generated on the train and test set are quite low and very similar. This indicates that Gradient Boosting Regressor has less of an overfitting issue than the other two models. Another indication is that through our hypothesis testing, our P-values indicate that Gradient Boosting Regressor performs the best among all others.

Persona Analysis

  • Group members: Ruotian Gao, Xin Yu, Weihang Gao

Materials:

Abstract: In this project, our goal is to find the relationship between the Persona and their PC system. We are trying to compare the performances of different models and use the features in the system to predict the type of a user. To achieve this, we collect data from the user end, clean the data, explore the data using hypothesis tests and fit our data into some classification machine learning models, and also test the performance and optimizing parameters of our models.

Predicting a User’s Persona Using Computer’s Specifications, CPU Utilization, CPU Temperature & Application Usage Time

  • Group members: Jon Zhang, Keshan Chen, Vince Wong

Materials:

Abstract: During the first half of this project, we learned about Intel’s telemetry framework. The framework allows remote data collection from devices with Windows operating systems. Two important components of the telemetry framework are the Input Library (IL) and Analyzer Task Library (ATL). The IL exposes metrics from a device and the ATL generates on-device statistics from the data collected by the IL. In the second half of the project, we used pre-collected data provided by Intel that used their telemetry framework to create a classification model. Our goal with the model was to predict the persona of a user using their computer’s specifications, CPU utilization, CPU temperature, and time spent on certain types of applications. User personas were provided by Intel which classified if users were casual web users, gamers, communication, etc.. The classifications of these personas were done by Intel based on the amount of time users spent on certain applications based on their usage of different types of .exe files. For example, if a majority of a device’s time is spent on an application like Skype, they are most likely classified as a communication user. Similarly, if a user spends a majority of their time on the League of Legends .exe file, they are most likely classified as a gamer. After training multiple classification models, we were able to predict user personas with 64% accuracy using a gradient boosting classifier. In the following paper, we will discuss our hypotheses, processes, methodologies, and results.

Spatial Agent-based Modeling for School Reopening


Geographically Assisted Agent-based Model for COVID-19 Transmission (GeoACT)

  • Group members: Johnny Lei, Akshay Bhide, Evan Price, Kaushik Ganapathy

Materials:

Abstract: As schools attempt to reopen amid the COVID-19 pandemic, there is an increasing need to detect and quantify potentially risky activities in schools to help schools test out their individual reopening plans in order to prevent an outbreak. In this paper, we describe the development of a spatially explicit agent-based model to help detect risky activities and assess reopening plans for individual schools by incorporating elements such as behavioral factors, environmental factors, and effects from pharmaceutical and non-pharmaceutical interventions. Following this we describe the development of a gateway infrastructure powered by Apache Airavata to allow general-purpose users to run model simulations with user-defined parameters. Finally, we use the aforementioned model to estimate COVID-19 case counts and the effectiveness of proposed interventions over a two week period for a real school to demonstrate model usability.

Modelling COVID-19 Transmission in San Diego School Buses

  • Group members: Ziqian Cui, Bernard Wong, Farhood Ensan, Areeb Syed

Materials:

Abstract: Using Agent Based Modelling, we model the spread of COVID-19 in San Diego school buses, by running simulations of school trips, with the goal of providing guidlines on the key factors that impact the spread of the virus in the case of schools reopening for in-person education.

COVID-19 Spatial Agent-based Modeling: Single Room Infection

  • Group members: Eric Yu, Bailey Man, Songling Lu, Michael Kusnadi

Materials:

Abstract: Several models exist for the transmission of SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) based on varying assumptions and parameters. The Chu and Chen models investigate coronavirus transmission and infection as functions of nonpharmaceutical interventions (physical distances, masks) and respiratory droplets, respectively. The results of the Chu model suggest guidelines for social distancing (1 meter or more) between individuals and public usage of facial and eye protection, while the Chen model shows the relationship between droplet size and transmission range. The two models both attempt to examine coronavirus transmission, but they report results that are not necessarily conflicting, but rather, incomplete on their own. The significance of this problem is that because models vary depending on the parameters and underlying assumptions, there is uncertainty on how to filter out the valid and optimal inputs. In this replication study, we develop a simple infection rate model based on the results and parameters reported by the Chu and Chen models, the MIT COVID-19 Indoor Safety Tool, and the airborne.cam tool by Cambridge. The output of this experiment will be primarily a simulation where a user will be able to set parameters to see the resulting risks and infections caused by in person instructions. This report will be a secondary output along with the website and visual presentation and will be used as a guide to explain methods as well as theory behind the work.