Link

All times listed below are for Fall 2023. To see the Winter 2024 section schedule, check this spreadsheet.

This page will no longer be updated. See the course homepage and Ed for future announcements.

Domain Descriptions

DSC Capstone, 2023-24 @ UC San Diego

Overview

Welcome to the capstone program! The capstone program is a two-quarter sequence (Fall 2023 and Winter 2024) in which you will be mentored by a faculty or industry expert in their domain of expertise. By the end of Quarter 2, you will design and execute a project from that domain in teams of 2-4. You can see the projects from last year at dsc-capstone.org/showcase-23.

At a high level, here’s how the capstone program is organized:

You can see the syllabus for last year’s capstone offering here.

Enrollment

The Schedule of Classes for Fall 2023 will be released on May 23rd, and first pass will begin on May 26th. The available domains are not listed on the Schedule of Classes; instead, they are detailed below. Most domains are run by UCSD faculty, but some are run by industry partners (denoted with an Industry Partner badge).

Use the information here to choose the domain you’d like to enroll in. Once you’ve chosen a domain, all you need to do is enroll in the corresponding discussion section for DSC 180A once first pass comes, space permitting. (Nothing is stopping you from waiting until second pass, but it’s less likely you’ll get a domain of your choice.) Note that you cannot change domains between DSC 180A and DSC 180B.

All of the information here – domain offerings, section times, descriptions, summer tasks, etc. – is subject to change as mentors provide us with more information.

How should I choose a domain?

You should aim to choose a domain that suits your interests and preparation. By clicking the Read more button underneath a domain, you’ll get to learn more about the mentor, their mentoring style, the prerequisites that they’d like their students to have, tasks that they’d like their students to work on over the summer, and their students’ capstone projects in previous years (if any).

Good reasons to choose a domain:

Bad reasons to choose a domain:

Everything you produce for the capstone will have to be public on the internet for the rest of eternity with you and your mentor’s names attached to it – you want your capstone work to be something that you’re proud of and can talk about on job and graduate school applications. Who do you want writing you a recommendation letter?

What happens in DSC 180A?

In addition to meeting with your mentor each week, there will also be methodology instruction delivered by Suraj and the methodology course staff. However, the majority of this instruction will occur asynchronously, in the form of readings (like this one). This means that you can ignore the lecture and lab times that appear for DSC 180A on the Schedule of Classes. The Monday lecture slots (3-3:50PM and 4-4:50PM) will be used for Suraj’s office hours, but we don’t plan to use the rest of the times.

All prerequisites for DSC 180A will be strictly enforced. The prerequisites for DSC 180A can be found here. If you took either DSC 140A, DSC 140B, or DSC 148 to satisfy the machine learning prerequisite, you will need to submit an Enrollment Authorization System request in order to enroll in DSC 180A in fall quarter.

Note that since DSC 180A and DSC 180B are both 4-unit courses, you should expect to spend 12 hours a week on capstone-related work each quarter. Plan your class schedule accordingly – try not to take several time-consuming classes alongside the capstone.

With any questions about the content of a particular domain, contact the mentor. With any questions about the capstone sequence itself, feel free to email Suraj Rampure (rampure@ucsd.edu). With any questions about enrollment, please contact Student Affairs in the VAC.


Filter by subject area:

💊 Medicine and Bioinformatics
🤝 Causal Inference and Fairness
⚙️ Distributed Systems and Other Applications
🧠 Graphs and Deep Learning
🗣️ Language Models


💊 Medicine and Bioinformatics

(back to the outline)

Application of Deep Learning to Medical Imaging, Diagnosis and Management of Disease
Albert Hsiaoa3hsiao@health.ucsd.edu
TA: Samanvitha
A01 6 seats Monday 1-2PM, In-Person 📍 ACTRI 4W-526

This group will investigate applications of deep learning to the characterization of medical images. This will likely focus principally on radiography (x-rays) for the two quarters, but the skill set developed by students will be broadly applicable to other higher-dimensional medical image data.
Read more
  • About: Albert Hsiao is a San Diego native, completed a dual-major at Caltech in Biology and Engineering/Computer Science before returning to San Diego for medical school, joining the dual-degree MD-PhD Medical Scientist Training Program (MSTP). He is a graduate of the UCSD Bioengineering and Bioinformatics PhD programs, completing his thesis work with Shankar Subramaniam and Jerry Olefsky on the functional genomics of insulin resistance. He continued his post-graduate clinical training at Stanford in general surgery, diagnostic radiology, interventional radiology and cardiovascular imaging. While a clinical diagnostic radiology resident, he co-founded Arterys, Inc, a start-up company that developed a cloud-based medical imaging artificial intelligence platform and partnered with GE Healthcare to bring the 4D Flow MRI technology to market. He returned to UCSD as faculty in the Department of Radiology, where he continues to practice clinically and founded the Augmented imaging/Artificial intelligence Data Analytics (AiDA) laboratory to further build advanced imaging techniques for diagnosis and treatment of cardiovascular disease, infection and cancer.
  • Mentoring Style: Students will work with me for the capstone section, but may also interface with graduate students and/or post-doctoral fellows in my lab.
  • Suggested Prerequisites: None
  • Summer Tasks: Review this previous paper. Students will replicate some of its results in the first quarter.
  • Previous Project

Precision Genomics with Personalized Genetic Risk Prediction
Tiffany Amariutatamariutabartell@ucsd.edu
TA: Samanvitha
A02 6 seats Wednesday 10-11AM, In-Person 📍 FAH 2101

A polygenic risk score (PRS) is a weighted sum of an individual’s risk alleles across one’s genome for a particular phenotype, i.e. disease or other measurement. The weights are typically the effect sizes of the risk allele, estimated by a genome-wide association study (in the case of complex traits / polygenic diseases) or an eQTL study (in the case of gene expression). PRS have great potential to revolutionize preventive care. In theory, an individual may arrive at the clinic not knowing their genetic susceptibility to a disease, have their DNA sequenced, and learn what is their lifetime risk for the disease. There is a theoretical liability threshold of PRS at which individuals who have a PRS value lower than the threshold will not develop the disease and those with a value higher than the threshold will develop disease. For diseases with a monogenic basis, it has been shown that the same degree of disease risk can be conferred by polygenic risk alone (Khera 2018 Nature Genetics). PRS are generally useful for understanding how predictive genetics is of disease and how disperse the genetic contributions are. PRS is especially useful in understanding genetic liability when individual effects are too small to be easily detected by genome-wide association studies (Purcell 2009 Nature). In this capstone, students will use population genetics and genomics data to assess individual risk for disease outcomes and transcriptomic measurements. Students will learn to work with genotype data from 1000Genomes and genetic association data from genome-wide association studies (GWAS) and transcriptome-wide association studies (TWAS).
Read more
  • About: I'm a junior faculty member at UCSD jointly appointed at the Halicioglu Data Science Institute and the Department of Medicine (Division of Biomedical Informatics). I am a statistical geneticist and am interested in studying the genetic regulation of gene expression for the purpose of understanding human disease and complex traits. I received my undergrad degree in Biological Engineering at MIT, my PhD in Bioinformatics at Harvard Medical School, and completed my postdoctoral research in statistical genetics at the Harvard T.H. Chan School of Public Health. Now that I live in San Diego, I enjoy playing beach volleyball and hiking in the canyons with my dog!
  • Mentoring Style: For the first quarter, we will meet as one group as we are gaining familiarity with the core concepts of precision medicine and personalized genetic risk scores. For the second quarter, I will be separately with each project group. Students will work with publicly available datasets and will be doing all the analysis on their own. Typical analysis will begin with acquiring VCF or plink files for genetic data. Students will be expected to work on a remote server and use plink (a convenient tool with great documentation) to manipulate files, perform pruning and thresholding, and make disease predictions across individuals. Students will be expected to use R or python to visualize their results and assess their statistical significance.
  • Suggested Prerequisites: None
  • Summer Tasks: Read https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3912837, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6128408/, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6563838/, and https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7737656/
  • Previous Project

Time Series Data Science in Health
Benjamin Smarrbsmarr@eng.ucsd.edu
TA: Yanyi
A03 8 seats Wednesday 2-3PM, In-Person 📍 FAH 2101

Time series come from many places - different people, different sensors, different record systems, etc. How do we bring these together to tell a useful story about a health issue? Challenges exist in the analytics - how to best use these data to gain information; in UX/UI - how do you present the information to support human insights; in alignment - how do you integrate data into predictors when time resolution is different across sources; to parsing biological complexity - you and I are different, so what's the "right" signal look like? Past groups have worked on developing classifiers from time series data, developing data integration systems for patients, and developing data-driven interfaces to support clinical decision making.
Read more
  • About: I studied neurological control of hormones in rodents until I got frustrated with how little we know about the humans we're supposedly modeling with these lab approaches. Then I got into wearable and ambient sensor data to study humans. That lead me to develop novel approaches for getting information out of multimodal timeseries data, and to apply my biological expertise to help health algorithms use more human-appropriate assumptions than most classic linear-based classifiers. Still doing that.
  • Mentoring Style: I plan to integrate the DSC team with the Bionengineering capstone team of similar domain interests. Each team will do its own work, but also be expected to complement and collaborate with the others in the domain.
  • Suggested Prerequisites: Some exposure to health systems and biology will be helpful, but you will succeed (and maybe learn more!) without this.
  • Summer Tasks: Please read and consider looking up methods from the following papers over the summer. 1 a month would be a very slow pace and yet enough to bring you in well warmed. I presume you're coming in with some familiarity of classifiers and networks, so this is more about the systems and needs beneath these layers.Note that the very large COVID-19 related data set in my lab is a controlled government resource (which I hate but which is the law) so we will not work in that. We will use public datasets, published data sets along side papers, and donated patient data from collaborating clinics.
  • Previous Project

Characterizing Extrachromosomal DNA in Cancer
Vineet Bafnavbafna@ucsd.edu
TA: Gabriel
A04 4 seats Tuesday 3-4PM, Zoom

Extrachromosomal DNA are formed when DNA breaks off from the chromosome, and circularizes. EcDNA evolve independently and can rapidly increase the number of cancer promoting genes leading to highly pathological cancers. In this project, we will look at images of tissues with the goal of identifying and separating individual nucleii as a precursor to identification of ecDNA. Students should be self-starting, have an interest in biology (some knowledge is helpful but not required), and be prepared to curate messy datasets through interactive tools.
Read more
  • About: I am a faculty member at UCSD since 2003, working in bioinformatics. My current research interests include 2 topics: Cancer Genomics, and Genomics for biodiversity. In a past life, I worked on he human genome project at a startup.
  • Mentoring Style: We will have a small research group of capstone students and one Ph.D. student who is also just starting out.
  • Suggested Prerequisites: None
  • Summer Tasks: Summer is a great time to start. Interested students should contact me towards the end of May. We will provide some reading and a plan for summer to interested students.

Sepsis: Using Clinical Healthcare Data Science to Identify and Combat an Infectious Killer
Kyle Shannonkshannon@ucsd.edu
TA: Yuyao
A05 8 seats Thursday 1:30-2:30PM, In-Person 📍 HDSI 355

Students will explore the world of inpatient ICU care by examining severe infection management and detection using the MIMIC dataset, a comprehensive, publicly available database of de-identified ICU patient data. This project will familiarize participants with healthcare data nuances and the critical role EHRs play in clinical decision-making. Through this experience, students will gain insights into the broader context of clinical decision-making and public health, learning to leverage EHRs and clinical data science for developing potential products, reports, and or health policies. They will better understand the US healthcare system, ICU operations, and the decision-making process for complex infectious cases like sepsis. By studying the work of multidisciplinary teams, students will gain a deeper understanding of intricate ICU cases and the patients' journeys through this challenging healthcare landscape. Additionally, they will appreciate the complexities of conducting data science in a demanding environment.
Read more
  • About: Hi 👋 I’m Kyle Shannon, as a professional in the public health and data science fields, I am dedicated to improving healthcare accessibility and enhancing patient outcomes, particularly in rural America. My journey began at UCSD, where I studied in the CogSci department as an undergraduate and discovered my passion for Data Science when it was still an emerging field (2013). I later pursued my master's degree in Data Science at UCSD, and eventually co-founded a startup focused on healthcare access. My enthusiasm lies in data science projects that directly impact patient health outcomes, and I maintain a keen interest in cognitive neuroscience and tiny ML systems. Outside of work, you can find me on a tennis court or delighting in the ambiance of a cozy cafe while tackling projects.
  • Mentoring Style: My goal is to create a capstone experience that emulates a practical job setting, guiding students in effectively interacting with managers and data science leads, asking relevant questions, and fulfilling their responsibilities. I may assume various roles (e.g., DS lead, stakeholder, hospital admin, manager) to offer diverse perspectives. I incorporate a business angle to discuss the project's broader context, encouraging students to envision their work in scenarios such as product development or hospital consultancy. This approach helps them grasp real-world applications and develop a compelling narrative for their projects. I prioritize accessibility for my students throughout the week, for example, via Discord, and may involve domain experts for them to interview and learn from professionals in ICUs and EHR data. This context adds valuable insight and humanizes the data/system. I often hold informal meetings with my students over coffee to discuss progress and answer questions. Occasionally, I expect them to provide progress reports and mini-presentations, simulating a real-world organizational experience.
  • Suggested Prerequisites: These are not suggested prereqs, but if students have had these classes they will have an edge on the domain side. I just wanted to point out them to students.
    • BILD 26. Human Physiology (4)
    • USP 143. The US Health-Care System (4)
    • FMPH 101. Epidemiology (4)
    • FMPH 102. Biostatistics in Public Health (4)
    • BICD 140. Immunology (4)
    • BIEB 152. Evolution of Infectious Diseases (4)
  • Summer Tasks: The following are recommended summer domain readings and tasks. Getting through some or all of these, especially if you are a bit unfamiliar with the domain, would be a good idea. And help you to hit the ground running in the fall. I will be available during the summer to meet with you as a group once or twice if you wish. On my capstone website, I will make available all material for the Fall, and put a notation by the items I think would be good candidates to begin with over the summer. For clarity, during the summer, the three areas I recommend focusing on would be:
    • Familiarizing yourself with EHR data
    • Learning about the MIMIC dataset
    • Beginning to understand a bit more about clinical critical care in an ICU

Upscaling from Old, Boring Microbiome Data to New Data That Will Change the World
Rob Knightrobknight@ucsd.edu
TA: Samanvitha
A06 8 seats Wednesday 3-4PM, In-Person 📍 Biomedical Research Facility II 2103

Traditionally, we have read out microbial communities using just one gene, the 16S rRNA, which acts as a. taxonomic marker. Newer technologies allow us to read out all the genes in a sample and increasingly to assemble complete genomes. However, we have legacy datasets of millions of samples that, if upscaled, could transform our understanding of systems relevant to climate, biofuels, biodiversity/extinction, and the human body (with indications ranging from cancer to neuropsychiatric disease). We are looking for students interested in applying transformer or other advanced models to push the limits of what we can obtain from the old datasets, ground-truthed with some of the largest new microbiome datasets currently being generated in the world.
Read more
  • About: My lab develops microbiome technologies, both computational and in the lab. I cofounded the Earth Microbiome Project and the American Gut Project/Microsetta, and was heavily involved in the Human Microbiome Project. We have related both human and environmental microbiome states to many different variables of clinical, scientific and/or societal importance. I also run or in leadership of several of UCSD's COVID-19 safety programs.
  • Mentoring Style: Will involve PhD students and/or postdocs depending on what direction group takes, although I plan to be there every session except when unavoidable (e.g. travel). I usually allow several weeks to explore the general area then either encourage the team to focus on a project of their choosing or provide more directed guidance. The 2021 capstone project led to the campus wastewater dashboard we all use here: https://returntolearn.ucsd.edu/dashboard/index.html
  • Suggested Prerequisites: None
  • Summer Tasks: We primarily use Python. Please familiarize yourself with PyTorch, Qiita, and QIIME2. My Coursera course "Gut Check" and my TED talk are rather dated but give a sense of what we're trying to do overall.
  • Previous Project

Hierarchical Latent Variable Models for Neural Data Analysis
Mikio Aoimaoi@ucsd.edu
TA: Yuyao
A07 4 seats Thursday 11AM-12PM, In-Person 📍 HDSI 138

Recent years have seen an explosion in the ability to routinely record from hundreds of neurons simultaneously. Data analysis methods, however, have not kept pace and there are many scenarios in which structured latent variable models could provide effective and interpretable data summaries. in these capstone projects we will review the relevant neuroscience problem setting and neurophysiology, review the history of dimensionality reduction in systems neuroscience, and learn the mathematics of latent variable models. We will then develop some novel models that could be utilized in current neuroscience experiments.
Read more
  • About: I'm a computational neuroscientist with a research focus on data science methods for neuroscience experiments. My academic career was anything but linear. I started my undergraduate career with no interest in math and knowing nothing about neuroscience and now I teach and do reach at the intersection of both. I'm interested in closely examining the way that we ask questions using data and dreaming up new ways of extracting meaning about what we are and how brains function from the raw numbers.
  • Mentoring Style: I'll provide a great deal of context, coaching, and direction for the project but my capstone students will need to spend time figuring things out for them selves. we'll start out slow with exercises and "home works" that will be challenging but will prepare students to think critically about their work and take on challenges as they come.
  • Suggested Prerequisites: Probabilistic modeling, optimization, and linear algebra.
  • Summer Tasks: Please read Chris Bishop's Pattern Recognition and Machine learning. Specifically:
    • Chapter 2, especially section 3 and the summary table on page 93
    • Chapter 3
    • Chapter 12
    • Chapter 10
    • Chapter 6
  • Previous Project

Modeling Mental Health with Mobile and Wearable Data
Tauhidur Rahmantrahman@ucsd.edu
TA: Gabriel
A08 6 seats Tuesday 10-11AM, In-Person 📍 FAH 2105

The group will develop machine learning models for Adolescent Brain and Cognitive Development with multimodal data including brain imaging, physiological sensor data, self-reports. Prof. Rahman will make the data available to the students for this secondary analysis.
Read more
  • About: Tauhidur Rahman is an Assistant Professor in the Halıcıoğlu Data Science Institute at the University of California San Diego where he directs the Mobile Sensing and Ubiquitous Computing Laboratory (MOSAIC Lab). His current research focuses on building novel ubiquitous and mobile health sensing technologies that capture observable low-level physical signals in the form of an acoustic and electromagnetic wave from our bodies and surrounding environments and map them to relevant biological and behavioral measurements. Some of his notable accomplishments include a Google Research Scholar Award in 2023, a Google Ph.D. fellowship in 2016 in mobile computing, a finalist position in Qualcomm innovation fellowship in 2015, Outstanding Teaching Award 2015 from Cornell University, one best paper award in ACM Digital Health 2016, one best paper honorable mention award in ACM Ubicomp 2015 and a distinguished paper award from ACM IMWUT in 2021. His work has been featured in several US-based and International media outlets including Wall Street Journal, MIT Technology Review, NewScientist, Public Television for Western New England, Daily Mail (UK) and Hindustan Times (India). His laboratory has been funded by NSF, NIH, DARPA and industry grants.
  • Mentoring Style: I will be the primary mentor. My PhD student will also co-mentor the team. While I will the main to-go person for the capstone, the PhD student can provide additional help with coding and data analysis.
  • Suggested Prerequisites: None
  • Summer Tasks: The students are encouraged to get some familiarity with time series modeling and machine learning fundamentals.
  • Previous Project

Estimation of Connectivity in the Human Brain Using Functional MRI Data
Armin Schwartzmanarmins@ucsd.edu
TA: Samanvitha
A09 6 seats Wednesday 3:30-4:30PM, In-Person 📍 HDSI 138

An important aspect of mapping and understanding the brain is to determine which parts are connected to which. Connections in the brain may be inferred by estimating the correlations between the time series of brain activity measured by functional MRI. In this project, we will learn how to estimate the brain connectivity structure by performing statistical analysis of the Human Connectome Project data.
Read more
  • About: With an undergraduate degree in electrical engineering, I discovered statistics for my PhD and have been doing data science since then (even when it wasn't called by that name). Much of my work involves signal and image analysis, but I'm interested in many theoretical and applied problems, even philosophical. Outside of academia, I like doing music, dancing, swimming, and more.
  • Mentoring Style: Mentoring will involve data science PhD student Gabriel Riegner. Students are expected to take ownership over the project. This implies taking initiative in learning about the topic (from the assigned material and other sources), implementing the methods in code, being resourceful when needing help, and asking questions. Students are expected to put in their best effort, plan their time over the quarter, make substantial progress each week, report on it each week, and come up with an action plan for the next steps (as opposed to waiting for the mentor to give instructions). In other words, be independent, but ask for help when needed.
  • Suggested Prerequisites: Probability and Statistics (e.g. CSE 103, ECE 109, MATH 180A, MATH 183, MATH 189)
  • Summer Tasks: The intention of the project is to replicate the default mode network paper (link), which outlines two anticorrelated networks (task-positive and task-negative) that can be reliably detected in resting-state fMRI. Data from the HCP would be used to build region-to-region correlation matrices using the xDF method. Then the rows and columns of the correlation matrices would be grouped by network to visualize correlations within network and anticorrelations between networks. Additional reading material:
    • Functional connectivity in general: Chapter 8 from the Handbook of Functional MRI Data Analysis (link)
    • Modules 16-20 from this Principles of fMRI course (link)
    • Specific nilearn tutorials: This example goes through the steps of calculating a pairwise correlation matrix and plotting it in 3D brain space (link)
    • Specific statistical methods: The xDF paper (link), an example xDF notebook, which includes instructions on how to download the code (link)
    • The Data: An overview of the Human Connectome Project (link)
  • Previous Project

🤝 Causal Inference and Fairness

(back to the outline)

Ethical Considerations in Using Artificial Intelligence
Emily RamondWith questions for industry mentors, email Suraj
TA: Praveen
A10 10 seats Friday 11AM-12PM, Zoom Industry Partner: Deloitte

There is a growing societal concern over the potential and real negative effects of AI, particularly in terms of fairness and explainability. This concern is considered in this course where students will study high-profile cases of algorithmic discrimination, explore different definitions and metrics of AI fairness, and understand their practical implications. The challenge lies in translating these complex concepts into real-world applications, training students to independently analyze AI fairness and explainability, and emphasizing the societal impact of these issues. The course aims to equip students with skills to assess algorithmic fairness, understand data limitations, and apply bias mitigation techniques in AI models. Students will explore the ethical dimensions of artificial intelligence (AI), with a specific focus on fairness assessments and bias mitigation. This course integrates practical workshops, case studies, include IBM AI Fairness 360 Model Overview and the evaluation of model bias using Medical Expenditure data. Through lectures, workshops, readings, and hands-on projects, students will gain an understanding of how to assess algorithmic fairness, measure fairness metrics, and identify the limitations of data in capturing fairness. They will also learn techniques for mitigating bias in AI models through pre-, in-, and post-processing. The course will emphasize real-world applications and the impact of ethical AI considerations on different stakeholders. Students will engage in replication projects and independent analyses to develop their skills in fairness assessments and bias mitigation.
Read more
  • About: Emily completed her undergraduate studies at HDSI in 2022, where she was an active member of Marshall College. Her capstone project centered around causal inference. Post-graduation, Emily joined Deloitte as a Business Technology Analyst. In this role, she engaged in diverse tasks encompassing data analytics, machine learning, and engineering for a wide array of clients. Beyond academic and professional pursuits, Emily loves crocheting, travel, snowboarding, and fostering cats.

    Drawing inspiration from her coursework at Marshall College, Emily is passionate about ethical artificial intelligence. Her commitment extends to prioritizing fairness, transparency, and accountability. She is driven by her interest in leveraging the power of data science for the betterment of the world.
  • Mentoring Style: The capstone program is based in active participation from all students. The mentors will provide overall guidance, and a high level of student independence is required. Highlights:
    • Understand the implications of the impossibility theorem for organizations employing AI
    • Develop ethical AI models considering data-specific issues and fairness metrics
    • Explore pre, in, and postprocessing techniques for mitigating fairness issues
    • Analyze the impact of non-technical considerations on the ethical impacts of AI
    • Investigate ethical considerations across different industries and AI techniques
    • Examine the perspectives of stakeholders and the implications of false classifications
    • Utilize the AI Fairness 360 Model and Medical Expenditure data for practical projects
    • Gain insights into data science project management and collaboration within AI teams
  • Suggested Prerequisites: None
  • Summer Tasks: Review the following tools and documents:Links to Work by Previous Capstone Students
  • Previous Project

Learning and Using Causality for a Further Step of Machine Intelligence
Biwei Huang and Babak Salimibih007@ucsd.edu and bsalimi@ucsd.edu
TA: Yuyao
A11 12 seats Friday 10:30-11:30AM, Zoom

Causal information is essential in many tasks in empirical sciences and engineering. For example, in medical science, to find out an effective, reliable treatment for Alzheimer’s disease, it is crucial to find the underlying genetic factors that are responsible for the disease and figure out how they jointly influence the disease. In AI, to achieve general-purpose AI with the capacity of reasoning, acting, and generalizing to novel domains or tasks, one challenge is to move beyond domain-specific pattern recognition towards the discovery and use of underlying causal relationships that produce stable and interpretable patterns across general scenarios of interest. To this end, two questions naturally arise: how can one acquire causal knowledge and, furthermore, how should we use it? Accordingly, we will focus on these two questions and study tools on (1) automated causal structure discovery from non-experimental data in complex environments, (2) advancing machine learning from the causal perspective, and (3) using or adapting causal discovery approaches to solve scientific problems.
Read more
  • About:
    • Biwei Huang: I am an Assistant Professor at HDSI. Previously, I received my Ph.D degree from CMU. My research interests include causal discovery and inference, causality-related machine learning, and computational science.
    • My research interests are centered around advancing the field of trustworthy data analysis. I draw inspiration from my deep interest in data management, firmly believing that reliable, accessible, and well-organized data is essential for building trust in data analysis. In my research, I aim to develop methods for responsible data management and analysis to ensure transparency, fairness, reliability, and robustness in algorithmic decision-making. By adopting a data management perspective, I strive to create tools and techniques that enable human decision-makers to interpret data with greater accuracy and confidence. In my research group, we are dedicated to developing tools that enable human decision-makers with heterogeneous backgrounds to interpret data and make better-informed decisions
  • Mentoring Style: I will suggest project ideas and guide students through the steps required to complete them. I will adapt my level of involvement based on the individual needs of each student.
  • Suggested Prerequisites: None
  • Summer Tasks:
    • Read the first two chapters of the following book: Causal Inference in Statistics - A Primer. By Judea Pearl, Madelyn Glymour, Nicholas P. Jewell
    • Read the 5th chapter of the following book: Spirtes, P., Glymour, C. N. & Scheines, R (2000). Causation, prediction, and search. MIT press.
  • Previous Project

Auditing Opaque Algorithmic Systems for Discrimination and Bias
Stuart Geigersgeiger@ucsd.edu
TA: Praveen
A14 4 seats Monday 12-1PM, In-Person 📍 HDSI 138

This group is for students interested in empirically investigating the outputs of real-world algorithmic systems for bias, discrimination, and other social issues --- particularly those where the code and/or training data are not publicly available. Do facial recognition classifiers work equally well on all kinds of faces? Does a job candidate's demographics impact which jobs they are recommended on a job search site? When you ask a generative image model to create images of a data scientist, what is the distribution by demographics?
We will study classic audits of non-algorithmic decision systems (e.g. equal opportunity hiring investigations in the 1970s) and contemporary audits of real-world ML/AI systems. We will learn various approaches to investigate such opaque systems, including auditing via synthetic training datasets, user reports, API scraping, fake/sockpuppet accounts, and headless browsers (where you programmatically control a web browser). We will also learn and discuss the legal and ethical issues around this kind of auditing, particularly around violating a platform's terms and conditions, which are complex. All students must take and pass the UCSD/CITI IRB Human Subject Protection Training online course (Social and Behavioral Basic) by week 3 of Fall, as well as submit their proposed Winter projects to the UCSD Institutional Review Board for legal and ethical review. For a selection of readings on this topic, see a past syllabus for a related graduate course: https://auditlab.stuartgeiger.com
Read more
  • About: I’m an social scientist with a background in the humanities, especially history and philosophy of science and technology, but I have enough expertise in computer science and data science to make trouble. I believe that data science systems should be fair, transparent, and accountable to the public, but that most are currently not. A lot of my research is in content moderation NLP systems for user-generated content, especially Wikipedia, where I formerly worked on their ML models and systems.
  • Mentoring Style: I will be the only point of contact. Students will be expected to propose their own project auditing an existing algorithmic system of their own choosing, collect data through methods like headless browsers, and analyze data. I will help with ideas and details.
  • Suggested Prerequisites: None
  • Summer Tasks: Most important: identify potential algorithmic systems to audit for discrimination
    Must take CITI IRB course by week 3 of Fall, about 2-3 hours, so get it done early if you can. Register at https://citiprogram.org (video of me registering, because the options are complex: https://www.youtube.com/watch?v=hOAgfK93QXg)
    Our main textbooks that give overviews of this work, with examples of audits and methods:Other readings and examples: https://auditlab.stuartgeiger.com
  • Previous Project

⚙️ Distributed Systems and Other Applications

(back to the outline)

Distributed Systems for Financial and Behavioral Data
Rod Albuyehralbuyeh@ucsd.edu
TA: Keng-Chi
A15 6 seats Friday 9-10AM, In-Person 📍 HDSI 138

In this group, students will learn advanced techniques for building distributed systems to solve large-scale problems in finance, social science, and political science. We will start by exploring techniques for tabular data pre-processing, with a specific focus on time-series and categorical transformations. Students will learn how to apply these techniques to real-world financial and behavioral data sets, with a particular emphasis on reproducibility and scalability. In addition to learning about pre-processing, we will also explore frameworks for scaling workloads beyond a single machine, including Ray, Modin, and Spark. By the end of the course, students will have gained practical experience in enterprise-like workflows and scale, and will be able to tackle complex challenges in data science and computer engineering. To ensure that students are well-prepared for future careers, we will also cover collaborative approaches to code versioning and open source development. Students will learn how to build and contribute to open source applications, which will be hosted in public repositories like PyPI. At the conclusion of the course, students will have a robust addition to their portfolio and a deeper understanding of distributed systems for financial and behavioral data. They will be better-prepared to tackle complex data science and data engineering challenges in both academia and industry.
Read more
  • About: I'm a part-time lecturer at HDSI and a machine learning architect with eight years of experience in enterprise ML. My research interests include deep learning for tabular data (with a focus on time-series financial data), reproducible research infrastructure, and applying machine learning to political and social science problems. Outside of academia, I enjoy practicing mixed martial arts, playing racquetball, and exploring San Diego with my family. I'm passionate about data science and machine learning, and I'm looking forward to working with students who like solving interesting problems with machine learning.
  • Mentoring Style: My capstone group will be treated as its own entity.
  • Suggested Prerequisites: None
  • Summer Tasks: It will be helpful spending some time learning the linux command line, git, python packaging, and becoming familiar with an IDE like PyCharm or something similar.

Blockchain
Rajesh Guptargupta@ucsd.edu
TA: Praveen
A16 4 seats Saturday 10-11AM, In-Person 📍 HDSI 138

The project will build upon earlier work on GymCoin and Goodwill coins to explore the world of new distributed applications that rely upon Blockchain properties.
Read more
  • About: Rajesh Gupta serves as a founding director of the Halıcıoğlu Data Science Institute and as a distinguished professor of Computer Science and Engineering at UC San Diego. His research is in embedded and cyber-physical systems with a focus on sensor data organization and its use in optimization and analytics. Prof. Gupta holds Qualcomm Endowed Chair in Embedded Microsystems at UC San Diego and INRIA International Chair at the French international research institute in Rennes, Bretagne Atlantique. He is a Fellow of the IEEE, the ACM and the American Association for the Advancement of Science (AAAS).
  • Mentoring Style: Mostly as a listener to the students.
  • Suggested Prerequisites: None
  • Summer Tasks: nan
  • Previous Project

Graph Database Solutions for Solar Cell Manufacturing
David Fenningdfenning@ucsd.edu
TA: Gabriel
A17 4 seats Friday 1-2PM, Zoom

We are interested in developing models to predict the efficiency and durability of next-generation solar cells from early measures of material quality, before the solar cell is even made. We will have data from several timepoints during the fabrication of the solar cell starting with the raw material, then more data about the expected performance as we add each layer of the cell, until we have an actual solar cell that we can measure the efficiency of. We want to determine which measurements and at what timepoints are most valuable to predicting solar efficiency. Similarly, we will expose the solar cells to operation in simulated sunlight and we would like to build models to predict the durability of performance from the material-level data, establishing predictive quality even before we've compelted solar cell.
Read more
  • About: David P. Fenning is an associate professor in NanoEngineering at UC San Diego, where he directs the Solar Energy Innovation Laboratory (SOLEIL). His group researches materials for solar energy conversion and storage. Currently, his work focuses on defects and reliability in silicon and hybrid perovskite solar cells and CO2 electrocatalysis for solar fuels. After completing his Ph.D. on silicon solar cell materials at MIT in 2013, he worked with the silicon R&D team at 1366 Technologies Inc., followed by an MIT/Battelle postdoctoral fellowship in solar fuels. He joined the NanoEngineering Department at UC San Diego in 2015. He is a recipient of the American Chemical Society's PRF New Investigator award and was recognized as a 2017 Hellman Fellow. His research is supported by the DOE SunShot Initiative and the California Energy Commission.
  • Mentoring Style:
  • Suggested Prerequisites: Prior experience or willingness to learn setting up NoSQL databases a plus. Visualizations will be done in Python and Javascript.
  • Summer Tasks: None
  • Previous Project

Blockchain
Sheffield NolanWith questions for industry mentors, email Suraj
TA: Praveen
A18 8 seats Wednesday 2-3PM, Zoom Industry Partner: Franklin Templeton

The project domain for college students pursuing degrees in Data Science within the blockchain field presents an exciting and dynamic landscape for exploration and innovation. In this domain, students can engage in projects that involve leveraging blockchain technology to address various real-world challenges. They can design and develop smart contracts for applications such as supply chain management, digital identity verification, or decentralized finance (DeFi). Students can also explore the integration of blockchain with emerging technologies like Internet of Things (IoT) or Artificial Intelligence (AI), enabling secure and transparent data exchange and enhancing data privacy. Furthermore, they can analyze blockchain data to identify patterns, detect anomalies, and develop predictive models for optimization and decision-making. By engaging in such projects, students gain practical experience in blockchain development, data analysis, and problem-solving, enabling them to contribute to the advancement of this transformative technology and become key players in the blockchain ecosystem.
Read more
  • About: Sheffield Nolan is an enterprise architect for Franklin Templeton focusing on FinTech innovation. Sheffield advises and provides technical guidance for early stage fintech companies within Franklin Templeton’s fintech partnerships and corporate strategic investments.
    Sheffield specializes in many key areas of FinTech including Artificial intelligence (AI) and Blockchain with an emphasis on DeFi, zero knowledge proofs, generative adversarial networks, Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformers (GPT)
    He is also a contributor to the official Coinbase Python API project on Github.
    Prior to joining Franklin Templeton, Sheffield was the founder and CEO/CTO of AppRedeem, an innovator in the mobile rewards space. Sheffield led AppRedeem's venture funding through 2 rounds totaling $1.7MM from Blue Run Ventures and SV Angel. AppRedeem was acquired by the publicly traded company Perk in 2015.
    Prior to AppRedeem, Sheffield developed apps that climbed to the top 5 paid and free positions in the Apple App Store (U.S. and international markets). Before that he architected and managed large scale solutions for many Fortune 500 companies and venture backed startups, including Visa, eBay and PayPal.
  • Mentoring Style:
  • Suggested Prerequisites: None
  • Summer Tasks:
  • Previous Project

🧠 Graphs and Deep Learning

(back to the outline)

Robust and Interpretable Neural Network Models for Computer Vision and Natural Language Processing
Lily Wenglweng@ucsd.edu
TA: Keng-Chi
B01 6 seats Monday 4-5PM, Zoom

The goal of this project is for students interested in robustness and interpretability for deep neural network models. Students will develop methods to improve robustness and interpretability of deep learning tasks such as computer vision and natural language processing.
Read more
  • About: Lily Weng is an assistant professor in Halıcıoğlu Data Science Institute with affiliation to Computer Science and Engineering Department at UC San Diego. Her research vision is to make the next generation AI systems and deep learning algorithms more robust, reliable, explainable, trustworthy and safer.
  • Mentoring Style: This project will be purely research-oriented and heavier than the usual course project. Students are expected to lead their capstone project under Prof. Weng's guidance. Students familiar with deep learning algorithms in computer vision or natural language processing, and deep learning libraries (e.g. pytorch) and neural networks are more likely to succeed in this project.
  • Suggested Prerequisites: Some of DSC 140A, DSC 140B, DSC 190 Trustworthy Machine Learning, CSE 151A, CSE 151B, CSE 150A, CSE 150B, CSE 152A, CSE 152B, CSE 156
  • Summer Tasks: If you are able to understand below papers and setup the code repo provided in below, then you are likely to succeed in this capstone project. If you have problems setting up the repo or/and understand the technical details in below papers, then this capstone session is very likely not a good fit for you (please don't be discouraged, it only means that you may require more background e.g. you are encouraged to take Prof. Weng's DSC 190 Trustworthy machine learning, also DSC 140A, DSC 140B, CSE 150-152, CSE 156 first before this capstone session).
    Reading:Code setup:Required experience in python and pytorch
    Excellent materials for deep learning: https://web.eecs.umich.edu/~justincj/teaching/eecs498/WI2022/
  • Previous Project

Bird Song Analysis and Classification with an Emphasis on Active Learning
Yoav Freundyfreund@ucsd.edu
TA: Yanyi
B02 4 seats Tuesday 3:30-4:30PM, In-Person 📍 HDSI 138

Bird songs are a rich area attracting many professionals and amateurs. The ornatology in Cornell (https://www.birds.cornell.edu/home/) is a leader in electronic resources for bird song identification, includingBeyond Cornell, there are many additional efforts in bird song identification, including the BirdClef Kaggle competition: https://www.kaggle.com/c/birdclef-2023
Locally in UCSD Scripps Oceanography, the Hildebrand lab (https://www.cetus.ucsd.edu/people/JohnHildebrand.html) has been collecting continuous bird songs for the last three years. Analyzing this continuous recording from a single site is a new challenge that can shed light on the behaviour of birds in the wild.
Read more
  • About: 30 years research in machine learning, in particular, learning theory. Interested in applications of ML to problems in Biology.
  • Mentoring Style: In the meetings I like to see Jupyter Notebooks that show the current state of progress as well as problems. Each meeting ends with tasks for the following week.
  • Suggested Prerequisites: Experience with DNNs and XGBoost. Experience with signal processing and filtering.
  • Summer Tasks: Obtain the code for the top contenders in Kaggle, as well as the latest version of birdnet. Reproduce some of the top results.
  • Previous Project

Deep Learning for Climate Model Emulation
Duncan Watson-Parrisdwatsonparris@ucsd.edu
TA: Yanyi
B03 6 seats Wednesday 2-3PM, In-Person 📍 HDSI 437

The choices humanity makes in the next few decades will determine how much warmer the Earth will be by the end of the century, with implications for billions of lives and trillions of dollars in GDP. Many different emission pathways exist that are compatible with the Paris climate agreement, and many more are possible that miss that target. While some of the most complex climate models have simulated a small selection of these, it is impractical to use these computationally expensive models to fully explore the space of possibilities or assess all the associated risks. Our lab has recently developed state-of-the-art climate model emulators to enable fast, accurate and reliable predictions for any given scenario (https://github.com/duncanwp/ClimateBench). This project will extend this work by incorporating multiple climate models at different levels of fidelity to provide high-resolution predictions with robust uncertainties for improved decision making.
Read more
  • About: Duncan Watson-Parris is an atmospheric physicist working at the interface of climate research and machine learning to investigate the effect of air-pollution on the climate. Using cutting-edge machine learning techniques to combine global models with satellite data his group looks to better understand complex aerosol-climate interactions and improve projections of climate change. He recently moved to San Diego from Oxford, England and enjoys soccer, chess and role-play games but is currently learning to surf!
  • Mentoring Style: This work is central to my research interests and will be integrated into my broader group program to the extent the students want to engage with it. The students will be welcome to join my group meetings (typically held at Scripps Institution of Oceanography).
  • Suggested Prerequisites: DSC 140A
  • Summer Tasks:

Semantic Search: Deep Learning Meets Information Retrieval
Colin Jemmottcjemmott@ucsd.edu
TA: Yuyao
B04 8 seats Wednesday 9-10AM, In-Person 📍 HDSI 336

This section offers a hands-on exploration of semantic search, focusing on applications beyond traditional web search, such as code search, e-commerce, and file search. In the first quarter, students will reproduce results from a recent paper on semantic search for code, gaining practical experience in implementing state-of-the-art techniques. The second quarter will involve a group project, where students collaborate to develop their own innovative search solution in a domain of their choice. This capstone emphasizes the importance of constructing functional systems and applying the latest advancements in deep learning and large language models to improve search accuracy and efficiency.
Read more
  • About: I'm a data scientist at Stitch Fix, where I build search and recommender systems. It is extremely gratifying to use data and math to help so many people make better decisions.
  • Mentoring Style: This section will be open-ended and ill-posed, meaning that you will need to first define and then solve the problems. I will do everything I can to help you, but you will play a leading role in our collaboration.
  • Suggested Prerequisites: None
  • Summer Tasks: Please get a copy of "AI-Powered Search" by Trey Grainger, Doug Turnbull, and Max Irwin. The publisher, Manning, often has sales on the eBook. I strongly recommend skimming the whole thing, but if you at lease look through Section 1 it will help us all have a shared foundation.
  • Previous Project

Diffusion Process as a Generative Model
Arya Mazumdararya@ucsd.edu
TA: Trevor
B05 3 seats Tuesday 11AM-12PM, In-Person 📍 HDSI 431

Diffusion process is a mathematical model used to describe how particles spread out over time. The diffusion process starts with an initial state and then iteratively updates the state of the system by applying a series of random perturbations. Each perturbation is assumed to be small and random, and the overall behavior of the system emerges as a result of the accumulation of these random perturbations. This has found an application in artificial image generation. You can progressively add noise to an image to create an extremely noisy image that are indistinguishable from pure noise. However, if we know the reverse process, we can perform a denoising to get the original image back. People have learned this reverse diffusion process and applied them to pure noise to get artificial images.
Read more
  • About: Arya Mazumdar is an Associate Professor of Data Science in UC San Diego. Arya obtained his Ph.D. degree from University of Maryland, College Park specializing in information theory, which is the science of quantifying of information, and belongs to the intersection of statistics and engineering/computer science. Subsequently Arya was a postdoctoral scholar at Massachusetts Institute of Technology, an assistant professor in University of Minnesota, and an assistant followed by associate professor in University of Massachusetts Amherst. Arya is interested in aspects of statistical inference, and optimization now-a-days.
  • Mentoring Style: Students should be able to code independently
  • Suggested Prerequisites: Probability, Markov Chains, Inference, Signal Processing, Linear Algebra
  • Summer Tasks: Understand this: https://dataflowr.github.io/website/modules/18a-diffusion/
  • Previous Project

Hardware Acceleration of ML Algorithms
Rajesh Guptargupta@ucsd.edu
TA: Trevor
B06 4 seats Wednesday 10-11AM, In-Person 📍 HDSI 336

Machine Learning Acceleration using Hardware such as FPGA refers to design and implementation of hardware blocks that are useful in either acceleration of application codes (such as manipulation of graph neural networks) or in acceleration of architectural mechanisms (such as prefetches, memory assists etc). In this project you will explore use of architectural mechanics that substantially speedup the selected ML codes.
Read more
  • About: Rajesh Gupta serves as a founding director of the Halıcıoğlu Data Science Institute and as a distinguished professor of Computer Science and Engineering at UC San Diego. His research is in embedded and cyber-physical systems with a focus on sensor data organization and its use in optimization and analytics. Prof. Gupta holds Qualcomm Endowed Chair in Embedded Microsystems at UC San Diego and INRIA International Chair at the French international research institute in Rennes, Bretagne Atlantique. He is a Fellow of the IEEE, the ACM and the American Association for the Advancement of Science (AAAS).
  • Mentoring Style: Mostly as a listener to the students.
  • Suggested Prerequisites: None
  • Summer Tasks: None

Graph ML for Chip Profiling
Lindsey KostasWith questions for industry mentors, email Suraj
TA: Gabriel
B07 10 seats Monday 1-2PM, In-Person 📍 HDSI 355 Industry Partner: Qualcomm

Machine Learning is becoming an increasingly necessary technique in the design of chips due to the end of Moore’s Law and the increased complexity of the process, functionality requirements, and design time limits. A circuit represents a complex graph with unique properties that do not exist in more common graph ML applications such as those for social networks or biologic entities. As a result, graph machine learning offers a powerful set of techniques to understand the fundamental properties of the chip design and thereby create better designs more quickly. This capstone will expose students to graph algorithms and graph ML through the exploration of unsupervised learning on chip designs and equip them with the skills to tackle arbitrary graph modeling tasks.
Read more
  • About: Lindsey is a Senior Staff Machine Learning Engineer at Qualcomm where she has led multiple projects in ML-based CAD/EDA which have impacted global SoC design process for teams across the globe leading to significant savings in time-to-market, compute and NRE cost; consults on a variety of ML-driven initiatives in applications ranging from digital and analog design to 5G to licensing to standards and holds one granted and five pending patents related to this work. In 2021, she was honored by the Global Semiconductor Association (GSA) as the inaugural Female Up-And-Comer. Prior to joining Qualcomm, Lindsey was a 4-year scholarship athlete at Stanford University where she won two tennis national team championships. After graduating with distinction in Economics, she obtained her master’s degree in Computer Science with an emphasis in Artificial Intelligence from Stanford University where she pursued research in deep representation learning for Chris Re and Jure Leskovic. Her current research interests are in the intersection of ML and constrained optimization, human-centered AI, graph ML, and generative models. Beyond her technical work, Lindsey is president of Qualcomm Women in Machine Learning and devotes much of her free time to helping mentor & develop young engineers. She is also passionate about cooking and sports/athletics, in particular tennis, boxing, biking, running, hiking, and dance.
  • Mentoring Style: We plan to have ~5 industry mentors who will bring a mix of ML, SW, and domain of application experience as well as experience in different types of professional roles and years on the job. Seminar sessions will be a mix of technical tutorials and paper readings w/ discussion, with a rotation of the mentor leading each week's session. We seek to have highly interactive sessions where students come prepared to discuss that weeks topic. Mentors will look to students to define project topics and research plans, but actively advise these decisions as well as the technical project implemention.
  • Suggested Prerequisites: Deep learning; deep learning frameworks such as pytorch or tensorflow Helpful, but not required: graph algorithms; graph theory; ASIC/FPGA/VLSI design
  • Summer Tasks: Students should be familiar with a deep learning framework. Recommendation is for pytorch The following resources are helpful for any students looking to get a headstart on graph ML or better understand the domain of application

Utility and Wildfire Risks Detected From Street View Imagery
Phi NguyenWith questions for industry mentors, email Suraj
TA: Gabriel
B08 12 seats Wednesday TBD, MS Teams Industry Partner: SDG&E

San Diego Gas & Electric leverages many different public and private data sources to make critical decisions that impact our communities. We would like to explore Google Street View as a publicly available source of data to help us identify risks that can be observed from the perspective of San Diego citizens. The project goals are to quantify the ability to observe damaged assets or fire from commonly traveled paths, determine whether there are clear compliance infractions that can be seen from the citizen's perspective, and identify other utility-related hazards that can be seen from this public data source.
Read more
  • About: Dr. Phi Nguyen is a senior data scientist at San Diego Gas & Electric, where he leads the Data Science Center of Excellence. Dr. Nguyen graduated from UCSD with a Ph. D. in materials science and engineering, where he developed nanomaterials for clean energy applications. He has worked for several years as a consultant in the energy sector, where his focus was on using data to support policies that promote clean energy and energy efficiency. Dr. Nguyen joined SDG&E to focus on developing models to mitigate wildfire risk in California and has since expanded his work to other areas that benefit San Diego communities.
  • Mentoring Style: Student group will be a stand-alone unit at SDG&E led by Mentor. Mentor will first work with students to understand utility space, and then schedule time with other SDG&E staff who will provide tours, field visits, and other utility-specific training. Students will also be introduced to other data scientists and engineers at SDG&E who are available for support on an as-needed basis throughout the duration of the project. However, once an introduction is made, it will be up to the students to reach out to staff when support is needed. Students will be encouraged to run their ideas by staff members beyond the mentor.
  • Suggested Prerequisites: None
  • Summer Tasks: Review Google street view API documentation: https://developers.google.com/maps/documentation/streetview; any image detection experience

Generative AI for Good
Ali ArsanjaniWith questions for industry mentors, email Suraj
TA: Keng-Chi
B09 8 seats Thursday 4-5PM, In-Person 📍 HDSI 138 Industry Partner: Google

"Generative AI for Good" refers to the application of generative artificial intelligence (AI) techniques to address societal challenges and promote positive outcomes. In the context of misinformation and disinformation detection and mitigation, it involves leveraging generative AI models to combat the spread of false or misleading information and reduce socio-political polarization. Generative AI models, such as language models and deep learning algorithms, have shown remarkable capabilities in generating text and content that closely resembles human-produced content. These models can be trained to understand and analyze large amounts of data, including news articles, social media posts, and online discussions, to detect patterns and identify potential misinformation or disinformation. By employing generative AI techniques, it becomes possible to develop sophisticated algorithms and systems that can automatically identify false or misleading information, distinguish it from accurate information, and mitigate its impact on public opinion and discourse. These systems can analyze the content, context, and sources of information, looking for inconsistencies, logical fallacies, and biases that are indicative of misinformation. Generative AI can also play a crucial role in reducing socio-political polarization by promoting more balanced and factual narratives. By identifying and flagging content that contributes to polarization, algorithms can provide users with alternative viewpoints, fact-checking information, or context that helps to counterbalance the biases inherent in some narratives. This can encourage critical thinking, promote a more informed public, and foster constructive dialogue across diverse perspectives. However, it is important to note that generative AI techniques are not without challenges. Ensuring the accuracy and fairness of these models, avoiding biases, and balancing freedom of expression with the need to combat misinformation are critical considerations. Ethical guidelines and rigorous validation processes should be put in place to address these concerns and ensure the responsible and effective deployment of generative AI for good in the context of misinformation and disinformation detection and mitigation. alternusvera.com
Read more
  • About: Dr. Ali Arsanjani is the Director of Cloud Partner Engineering at Google Cloud, where he leads strategic co-innovation partnerships in Generative AI, Data/Analytics, and Predictive AI/ML. He is also the Head of AI Center of Excellence and works closely with product management to shape Google's AI and analytics offerings from a cloud perspective. Prior to joining Google, Ali held significant roles at AWS, IBM, and other companies, where he played key roles in the development and implementation of AI and ML solutions. He is recognized for his expertise in areas such as NLP, deep learning ensemble models, customer segmentation, and conversational virtual assistant implementations. Additionally, Ali has served as an adjunct professor and has contributed to the development of industry standards in service-oriented architecture and cloud computing.
  • Mentoring Style: Will combine the students in the capstone with a larger research group; responsibility, diligence, accountability, and a desire to explore are essential
  • Suggested Prerequisites: Experience with deep learning frameworks
  • Summer Tasks: Papers on Transformers, hands-on small bite-size projects that can help pave the way for the generative AI project domain

Leveraging SANDAG Unique Data to Enhance Public Planning and Quality of Life for the San Diego Region
Cindy Burke, Ray Major, and Liang TianWith questions for industry mentors, email Suraj
TA: Praveen
B10 10 seats Thursday 2-3PM, Zoom Industry Partner: SANDAG

The San Diego region is entering a new era of innovation, driven by exciting advances in technology, biotechnology, renewable energy, cybersecurity, and automation that will change the way people live. As a metropolitan planning organization and a council of governments, HDSI Industry Partner SANDAG (San Diego Association of Governments) is bringing together local governments to develop solutions to regional issues, including improving equity, transportation infrastructure (such as UC San Diego Blue Line Trolley Extension), air quality, clean energy, economic development, goods movement, public health, public safety, housing, and so much more. SANDAG Data Science team plays an integral part and standing at the forefront to help achieve these missions in analyzing and disseminating regional data; and publishing research on a broad range of topics related to the region’s quality of life.
Would you like to work with a group of talented Ph.D. Data Scientists in the public domain to improve well-being and quality of life in our beautiful San Diego region? In this domain, students will explore wide varieties of SANDAG publicly-available unique data assets (https://opendata.sandag.org/); experiment with multi-dimensional data science disciplines including data ingestion, data transformation, data enrichment, creating analytics, visualization and predictive model development; propose innovative ideas and have the opportunity to work with SANDAG staff to make our region one of the best places to live.
This domain will require hands-on skills with proficiency level of modern data science tools, and work with both structured and unstructured data. Example project outcomes potentially include: enhancement on data pipeline development; improve SANDAG unique data products using advanced ML algorithms; reconnecting disadvantaged communities; equitable transportation community assessment; transit behavior analysis, etc.
Read more
  • About:
    • Cindy Burke is Senior Director of Data Science at SANDAG. She received her Ph.D. at UCSD and has held various leadership roles across the agency for over 3 decades. It is an exciting place to be because we have the opportunity to use data to really have a positive impact on so many different ways in the San Diego region.
    • Ray Major is an expert in regional economics who advises organizations throughout the county on emerging and long-term economic trends. He is the Chief Economist and Deputy CEO of Business Operations at SANDAG where he leads strategic direction and agency operations. He regularly speaks on economic forecasts, the impacts of inflation, border economics, and more. Ray is an industry fellow for the UCSD Data Science Institute, serves on the boards of three regional economic development councils, and regularly contributes to the San Diego Union-Tribune’s Econometer. With 25 years of private sector experience, Ray brings a unique business perspective to regional planning. He has worked everywhere from Fortune 500 companies to high-tech startups. Ray takes pride in offering objective, data-driven assessments with accuracy and transparency. He serves San Diego residents by communicating how current economic trends affect our daily lives.
  • Mentoring Style: We would love to invite students come to SANDAG campus on a regular basis (but may not need to be every week) to be able to interact with SANDAG staff to help formulate potential proposal for projects. For example, mentoring may involve Data Science Managers and Ph.D. Data Scientists.
  • Suggested Prerequisites: None
  • Summer Tasks: Please explore SANDAG Open Data Portal for all available data for potential use in the projects. https://opendata.sandag.org/

From Theory to Pixels: Diffusion Models for Image and Data Generation
Alex Cloninger and Rayan Saabacloninger@ucsd.edu and rsaab@ucsd.edu
TA: Keng-Chi
B11 8 seats Thursday 12:30-1:30PM, In-Person 📍 HDSI 138

In the fast evolving field of digital image generation, diffusion models have emerged as a powerful technique to create high-quality synthetic images, as well as other structured signals. They do so by leveraging training data to implicitly learn the underlying distribution and guide the generative process. In the case of images, for example, they effectively deconstruct and reconstruct images using a combination of carefully designed stochastic processes, probabilistic models, and deep neural networks. The domain of this project encompasses the basic theoretical foundations of diffusion models as well as their practical applications in generating visually compelling results. Students who choose this project will delve into the mathematical and computational principles behind diffusion-based methods, exploring concepts such as stochastic processes, partial differential equations, and probabilistic models. They will also engage in hands-on coding and experimentation with diffusion models, applying them to various datasets and signal models.
Read more
  • About:
    • Alex Cloninger is an Assistant Professor in Mathematics and the Halicioglu Data Science Institute. He works on computational models for learning similarities between data, and using these similarity measures to solve various scientific problems. Find out more about Dr. Cloninger's research: https://ccom.ucsd.edu/~acloninger/index.html
    • Rayan Saab is an Associate Professor in the Mathematics Department and at the Halicioglu Data Science Institute. He works on developing computational methods and theory for solving problems related to collecting, processing, and analyzing data. He came to this work first through an undergrad degree in electrical engineering and finding himself always interested in both making things work and understanding why they do. Find out more about Dr. Saab's research: http://www.math.ucsd.edu/~rsaab/
  • Mentoring Style: We both are relatively hands-on in the sense that we make ourselves available for problem-solving and discussions. That said, students have to be self-motivated, and motivated to do the readings and the work.
  • Suggested Prerequisites: Having a course or two on probability and/or machine learning would be very helpful -- in particular, being familiar with Bayes' rule and conditional probabilities (or being willing to learn very quickly). Familiarity with the basics of neural networks would also be very helpful.
  • Summer Tasks: Here are some relevant readings. The students needs not go into all the mathematical details of the papers as we can go through them together, but these papers give an idea of the domain. The more familiar they are with the topic, the more we can do!One objective of the replication project could be to recreate the numerics in the third paper (Sohl-Dickstein et al.)
  • Previous Project

Handling Long Range Interactions in Graph Neural Networks
Gal Mishne and Yusu Wanggmishne@ucsd.edu and yuw122@ucsd.edu
TA: Trevor
B12 12 seats Wednesday 9-10AM, In-Person 📍 HDSI 355

Graph data are ubiquitous in a broad range of applications. Recently, we have witnessed the tremendous use and success of various versions of graph neural networks (GNNs) in analyzing and learning on graph data. However, the popular architecture of GNNs have difficulty in effectively aggregating information along long range interactions (e.g, between nodes that are far away in the graph). In particular, as the graph becomes larger, it is well-known that some popular GNN architectures have issues such as over-smoothing or over-squashing. Being able to effectively capturing / learning based on long-range interactions in graphs is one of the most important challenges faced in graph learning. More recently, graph transformer type of architecture has gain popularity in handling long range interactions as well. But they have various limitations (including scalability). The goal of this group is to learn and run some SOTA GNN models, and aims to develop some strategies to improve their ability in capturing long range interaction in input graphs.
Read more
  • About:
    • Gal Mishne's research is on geometric data analysis. My research focuses on modeling data as lying on a graph or being sampled from a (nonlinear) manifold, and developing methods that take this geometry into account in order to process, analyze, and visualize high-dimensional data. My research group primarily collaborates with neuroscientists and other biomedical researchers, to apply our models and methods to real-world data.
    • Yusu Wang's research primarily focuses on geometric and topological data analysis. I am particularly interested in developing effective and theoretically justified algorithms for data / shape analysis using geometric and topological ideas and methods, as well as to integrate such methods with machine learning frameworks. I aim to both provide theoretical understanding of various computational methods developed, and to apply them to practical domains, including computational biology, computational neuroanatomy, material science, computer graphics and visualization.
  • Mentoring Style: We expect students to be self motivated to do the reading and coding tasks, and to have some degree of independence in their capstone projects. Students are expected to treat the project seriously and devote time to making weekly progress toward their goals. We are always happy to discuss and help problem-solve.
  • Suggested Prerequisites: Students should be familiar with neural network architectures and have done projects on deep learning. Knowledge of graph neural networks will be a plus but not required.
  • Summer Tasks: check out pytorch geometric (https://pytorch-geometric.readthedocs.io/en/latest/)
  • Previous Project

Stochastic Optimization in Over-Parameterized Models: Neural Networks and Kernel Machines
Misha Belkin and Yian Mambelkin@ucsd.edu and yianma@ucsd.edu
TA: Yanyi
B13 12 seats Wednesday 9-10AM, In-Person 📍 HDSI 455

Stochastic optimization has been the main powerhouse for modern machine learning. In particular, one needs to tune the optimization methods for the over-parametrized models such as deep neural networks or kernel machines. During this capstone project, we will explore how to train those over-parametrized models with stochastic gradient descent and its variants. We will also study interesting phenomena arising from such technical problems.
Read more
  • About:
    • Yian Ma: I am an assistant professor at the Halıcıoğlu Data Science Institute and an affliated faculty member at the Computer Science and Engineering Department of University of California San Diego. Prior to UCSD, I spent a year as a visiting faculty at Google Research. Before that, I was a post-doctoral fellow at EECS, UC Berkeley and completed my Ph.D. at University of Washington. My current research primarily revolves around scalable inference methods for credible machine learning. This involves designing Bayesian inference methods to quantify uncertainty in the predictions of complex models; understanding computational and statistical guarantees of inference algorithms; and leveraging these scalable algorithms to learn from time series data and perform sequential decision making tasks.
  • Mentoring Style:
  • Suggested Prerequisites: Probability, Optimization, Linear algebra, Multivariate calculus
  • Summer Tasks: Please finish reading the following 3 papers during the summer. Pick one to go into the nitty gritty details. Close to the beginning of the quarter, prepare a short presentation about the paper. You can either use slides or prepare to present it on the whiteboard with the fellow students who pick the same paper. Papers:
  • Previous Project

🗣️ Language Models

(back to the outline)

Investigating the Impact of Input Methods on Human Language Evolution
Haojian Jinh7jin@ucsd.edu
TA: Keng-Chi
B14 6 seats Friday 4-5PM, In-Person 📍 HDSI 138

This project aims to investigate the relationship between input methods and the evolution of human language. Specifically, we will explore how the auto-suggestions and corrections provided by input methods can impact the vocabulary of each user. The project will focus on multiple languages, which have undergone significant changes in input methods over the past few decades.
More specifically, we will analyze the effect of input methods on the vocabulary usage of different language users. We also want to investigate whether using input methods leads to the gradual disappearance of certain words and phrases. Finally, we aim to examine input methods' impact on human language's evolution and the potential limitations it poses to human thought.
Read more
  • About: I'm an Assistant Professor in Halıcıoğlu Data Science Institute (HDSI) and Department of Computer Science and Engineering (affiliate) at UC San Diego. Our lab, Data Smith Lab, studies the security and privacy of data systems by researching the people who design, implement, and use these systems. I am interested in topics ranging from human-computer interaction, mobile computing, security & privacy, programming language, database, and software engineering.
  • Mentoring Style: I will have direct interactions with students, who are encouraged to contact me with any questions they encounter.
  • Suggested Prerequisites: Experience with writing crawlers would be a plus for students who want to participate in data collection for this project.
  • Summer Tasks: During the summer, students are expected to identify potential datasets and input methods and learn some knowledge about language evolution (e.g., https://www.pnas.org/doi/10.1073/pnas.96.14.8028).
  • Previous Project

Weakly Supervised NLP
Jingbo Shangjshang@ucsd.edu
TA: Samanvitha
B15 6 seats Wednesday 11AM-12PM, Zoom

We will play with massive text data and try to extract/discover interesting facts and findings without much human annotation effort. Large language models (LLMs) such as GPT-3.5 and GPT-4 will be our best friends to start with! Finding interesting applications of LLMs can be your capstone projects!
Read more
  • About: I’m an Assistant Professor at UCSD jointly appointed by Computer Science and Halıcıoğlu Data Science Institute. I obtained my Ph.D. from UIUC advised by Prof. Jiawei Han in 2019. I received my B.E. from SJTU in 2014. I’m also a coach of the UCSD’s ACM-ICPC team. My research focuses on data mining, NLP, and machine learning, with an emphasis on turning unstructured text data into structured knowledge and insights with minimum human effort.
  • Mentoring Style: I will lead the discussion in Fall and give mini-lectures. In Winter, it's more like stand-up meetings with Q&A.
  • Suggested Prerequisites: ML or data mining course experience. Heard about LLMs before.
  • Summer Tasks: Play with ChatGPT at least. Read some papers related to ChatGPT.
  • Previous Project

NLP/ML Tooling
Zhiting Huzhh019@ucsd.edu
TA: Samanvitha
B16 6 seats Tuesday 2-3PM, In-Person 📍 HDSI 138

We'll explore, design and, develop useful machine learning/natural language processing tools for either general ML/NLP research problems (such as reinforcement learning) or specific application domains (such as processing medical text data). We hope to make the tools open-source and hence contribute to the research/industry communities. Given the limited time, we will base our tools on popular infrastures like Huggingface, AutoGPT, etc.
Read more
  • About: Zhiting Hu is an Assistant Professor in Halicioglu Data Science Institute at UC San Diego. He received his Bachelor's degree in Computer Science from Peking University in 2014, and his Ph.D. in Machine Learning from Carnegie Mellon University in 2020. His research interests lie in the broad area of machine learning, artificial intelligence, natural language processing, and ML systems. In particular, He is interested in principles, methodologies, and systems of training AI agents with all types of experiences (data, symbolic knowledge, rewards, adversaries, lifelong interplay, etc), and their applications in controllable text generation, healthcare, and other application domains. His research was recognized with best demo nomination at ACL2019 and outstanding paper award at ACL2016.
  • Mentoring Style: The mentor will provide a couple of candidate projects; each student group is expected to either pick one of the projects or define their own project by themselves. During the project, the mentor will provide relevant materials (papers, tools, etc.) and necessary guidance when there are questions; students are expected to carry out the projects largely independently. There are also possibilities that students can choose to join a larger research group led by the mentor's PhD students.
  • Suggested Prerequisites: Natural Language Processing
  • Summer Tasks: Get familiar with deep learning tools like PyTorch and NLP tools like Huggingface
  • Previous Project

Developing Open Datasets, Models, Systems, and Evaluation Tools for Large (Language) Models
Hao Zhangsjtu.haozhang@gmail.com
TA: Trevor
B17 6 seats Thursday 1-2PM, In-Person 📍 HDSI 336

The rapid advancement of large multimodal models has revolutionized AI systems, resulting in unprecedented levels of intelligence as seen in OpenAI’s GPT-4. However, despite its performance, the training and architecture details of GPT-4 remain unclear, hindering research and open-source innovation in this field.
I propose to build an open platform for large multimodal chat assistants, based on our ongoing open-source effort, Vicuna, an impactful language model that has been widely adopted for building generative AI applications. I propose three thrusts to extend Vicuna as an open and unified platform for large multimodal models:
  1. On the system side, I propose an infrastructure for scalable training and high-throughput serving with advanced memory management and parallelization techniques.
  2. On the model side, I aim to build a large multimodal model close to ChatGPT quality, which can also interact with the real world by taking actions and using tools.
  3. On the data and benchmark side, I plan to develop a gamified data collection and benchmark platform with novel data augmentation, data filtering, and ranking methods.
If successful, the proposed platform will provide ML developers one-stop-for-all experience for training, serving, and evaluating large multimodal models.
Read more
  • About: Hao's research interests are in the intersection of machine learning and systems, with the focus on improving the performance and ease-of-use of today’s distributed ML systems.  Hao works actively on democratizing the access of large languge models (LLMs). Hao has created several popular open source LLM projects, such as Alpa, Vicuna, and Fastchat. Hao's previous open-source artifacts in ML systems have been used by organizations such as AI2, Meta, and Google, and parts of Hao's research have been commercialized at multiple start-ups including Petuum and AnyScale.
  • Mentoring Style: Will involve my postdoc or PhD who are interested in talking. Hands-on in the early bootstrap stage and will be hands-off afterwards.
  • Suggested Prerequisites: Machine learning
  • Summer Tasks: Work through https://icml.cc/virtual/2022/tutorial/18440 and https://github.com/RUCAIBox/LLMSurvey

NLP Credit Score Development
Brian Duke and Berk UstunWith questions for industry mentors, email Suraj and berk@ucsd.edu
TA: Trevor
B18 12 seats Friday 11AM-12PM, Zoom Industry Partner: Prism Data

One of the most widely used and little understood parts of the Financial Services industry is the credit score. In this course, students will work with transactional bank data to build statistical models for the purpose of assessing creditworthiness in the financial services industry. The course will take students through the life of a model development project, from data exploration, through model training and evaluation. Students will have the opportunity to work with both structured and unstructured data as they learn about the process and attributes that go into credit scores. Additionally, students will learn about the importance of model explainability and fairness.
Read more
  • About:
    • Brian Duke has been a data scientist for 22 years, working primarily in the Financial Services sector. After graduating from UCSD with his BA in Math and MS in Stats, he has since worked at Capital One, FICO, SAS Institute, Bridgepoint Education, Accenture, Experian, Petal Card, and currently is the head of data science at Prism Data. He holds 3 patents, with 11 pending. Brian currently resides in Poway.
    • Berk Ustun's research combines machine learning, optimization, and human-centered design. He develops methods to promote the responsible use of machine learning in medicine, consumer finance, and the physical sciences. In particular, he studies topics like algorithmic fairness, interpretability, and personalization. Berk received a PhD in Electrical Engineering and Computer Science from MIT, and Bachelors degrees in Operations Research and Economics from UC Berkeley.
  • Mentoring Style: We'll meet every week and use the time to check in on your projects. We will work with you directly, rather than involving students and postdocs.
  • Suggested Prerequisites:
  • Summer Tasks: The course will focus on translating data in your bank account eventually in a credit score. In preparation for the course, it is recommended students familiarize themselves with NLP techniques for text classification. We will be first building a model to categorize banking transactions, then using the categorized transactions to eventually build a credit score.
  • Previous Project