The Summer Undergraduate Research Program (SURP) enables students to conduct extended, focused research in close cooperation with a Pomona faculty member. Below are recent summer research projects in the Computer Science Department.
Temporal Analysis of Computer Science Bibliography Data
Carlos Salas Ortega ’19; Advisor: Yuqing Wu; Collaborators: Gianna Wu ’19 and Baiyu Li ’19
Temporal information is an integral part of any data, in which data entries are represented as events that have a start and end time. In order to analyze temporal data and discover time-related insights, we used Allen’s interval algebra, which recognizes that for every pair of events, they must satisfy one and only one of 13 relationships, such as overlaps, in which two events partially overlap, and meets, in which one event starts at the time when the other ends. To analyze research collaboration data derived from DBLP, a CS biography data collection, we first created a Java program that analyzes CSV files and for each pairs of events returns the Allen Algebra relationship it satisfies. We also use SQL, a database query language, to manipulate the source data and generate derived data for plotting. We used an open source library called JFreeChart to generate graphs to visually present our findings. Many of the graphs we generated are very insightful. For instance, a heatmap that show the number of collaborations of certain length started in each year between 1930’s and 2017 highlights the recent rapid growth in the field of computer science and collaborations in the field. Besides discovering interesting patterns in the DBLP dataset, we also analyzed the runtime and space complexity of both our Java program and our code in SQL. We found that SQL processed data and produced results more efficiently, but Java offered more flexibility and resilience.
Funding Provided By: Paul K. Richter and Evalyn E. Cook Richter Memorial Fund
CNC Carving Factory: From Tablet to Wood
Sonia Grunwald ’18; Advisor: Michael Greenberg; Collaborator: Edinam Eshietedoho ’20
Over the course of the summer, my research partner and I conducted a project in which we, under the direction of Professor Michael Greenberg, developed an application that allows the user to draw on a screen and then save the image as a series of g-code commands that when run on a Laguna Smartshop II CNC Router will produce a carved version of the image in a piece of wood. The primary focus of this project, initially, was to develop this application to get to the point that it would be possible for any user to pick up the application, connect to the CNC machine, and be able to have one to one communication with the machine. Simply put, anyone would be able to "paint" in wood; however, ultimately this idea was not fully realized, due in large part to the fact that for such a connection to be made between the tablet application and the CNC machine, an encrypted connection must be established on both ends. Unfortunately, the technology that the CNC machine is composed of does not allow for this. Upon this realization the manufacturer was contacted to see if there was any way to get encryption on our specific machine. Our request was, unfortunately, denied. Despite these minor drawbacks to our research, the project and resulting application created is a solid step in the correct direction for the wood "painting" effect we desired. Not only does the application support touch drawing, but it is also capable of receiving external controller inputs such as a XBOX 360 controller.
Funding Provided By: Fletcher Jones Foundation
Intelligent Music Software
Isys Johnson ’20; Advisor: Gabriel Chandler; Collaborators: Cai Glencross ’18, Lukas Gnirke ’18, Oberlin College, Joseph Yaconelli ’20, University of Oregon, Nic Trieu ’18, HMC
Comping is a term to describe how a pianist, guitarist, or other chordal instrument plays chords in rhythm to propel the music forward as well as support the soloist. My goal was to add a more active style of comping to the Impro-Visor software by implementing interpolation of chords and substitution. The first step in our approach to interpolation is deciding whether or not to interpolate. So, after providing an overall probability that the user can customize, I go through the chord progression randomly assign a Boolean that indicates whether or not any interpolation will be added after that chord. The second part is matching possible interpolations to the boundary chords of a positive interpolation indicator. Generalizing chord qualities to match the left and right boundaries of the chord part to the left and right boundaries of the interpolants, we consult a list of weighted interpolants that could possibly be inserted between the original chords of the chord part. Also, we allow for arbitrary left boundary chord interpolates for flexibility. Using the weights, one interpolant is selected randomly for application, and each chord -- the boundaries and interpolant -- have their total durations split between them. Substitution was implemented similarly, but without the generalization of chords and by replacing chords in the progression instead of simply inserting. The generated accompaniments implementing these enhanced comping methods generate nice sounding backing tracks.
Funding Provided By: Dean’s Diversity Fund
Project 1, Improving Human Evaluation for Text Simplification:
Max Shwarzer '18; Mentor: David Kauchak
I examined the methods currently used to evaluate text simplification systems. I determined that the methods used were unreliable and biased, and had, in some instances, led to incorrect conclusions. After gathering data with Amazon Mechanical Turk, I developed a suite of best practices, governing sample sizes, evaluation metrics, controls used and experimental wording to address previous issues.
Project 2, Text Simplification with Deep Learning:
Max Shwarzer '18; Mentor: David Kauchak
Text Simplification with Deep Learning: I developed and evaluated a neural network text simplification system, using a soft-attention recurrent architecture originally developed for machine translation. I optimized it for text simplification, after which it achieved a state-of-the-art BLEU (quality) score for sentence-level text simplification, beating all other tested systems.
Online Text Simplification of Medical Texts
Melissa Grueter '18; Mentor: David Kauchak
Medical texts are often filled with specialized medical terms and other complex words that can render them difficult for the average person to understand. We developed an online text simplification tool designed to help the writers of medical texts make them more accessible. Because the importance of accuracy in medical texts precludes the sort of errors that are inevitable with a fully automated text simplification tool, our tool is designed for use as an editing tool supplemented by the writer’s discretion.
The tool uses word frequency on the web (measured using the Google corpus) as a proxy for word difficulty with less frequent words being viewed as more difficult. It highlights complex (low frequency) words and then suggests a menu of simpler (more frequent) replacements. Replacements were generated from WordNet. Morphological analysis is done automatically to ensure that the replacements represent the same morphological category as the original word.
Improving Lexical Simplification
Charlie Fries '17; Mentor: David Kauchak
Lexical Simplification is the task of simplifying text by substituting complex words or phrases with simpler alternatives. Automated lexical simplification can be divided into to two main subtasks: candidate generation and candidate selection. Candidate generation is the task of creating a list of alternative, simpler words for a given target word. Candidate selection consists of selecting the simplest and most context appropriate word from a given list. In our study we explored methods of improving performance on both subtasks
Previously, candidate simplifications were generated through extracting aligned words from an aligned corpus of English Wikipedia and Simple English Wikipedia. In our study, we included several new sources, such as the Newsela Aligned Corpus, Simple Paraphrase Database, and Glavas’s rule extraction model on Glove. Using the expanded rule set resulted in a 3% increase in accuracy compared to the original rule set.
Our study also included an analysis of the efficacy of our model features. We introduced several new feature groups, including embedding features, parse tree features, and features sourced from the Simple Paraphrase Database. Our results suggest that while most individual features may not make a significant contribution to system performance, each feature group as a whole contributes a nontrivial performance increase. Adding these features resulted in a cumulative 4% increase to accuracy in comparison to the original feature set.
Fluorescence Bioimaging of Organellar Network Evolution
Chinasa Okolo ’18; Mentor: Shannon Quinn (University of Georgia); Collaborator: Pramod Giri (University of Georgia)
Currently there are no bioimaging application models that have the capability to tag organelles as dynamic social networks. Being able to model these patterns as evolving networks with respect to time will yield additional insights not possible with existing bioimaging methods. The analytical framework designed in this project was implemented using open source tools such as ImageJ, MATLAB, and Mahotas, which will help increase the accessibility of these tools and the transparency of future analyses. From experimentation with various Python packages and scientific image processors, numerous images were produced that quantified the number of non-zero pixels in actin channels, the centroids of the cells, and the segmented parts of the cells. Given the limited time frame of ten weeks, there were no evident morphological changes observed between the uninfected and infected slides of the 6 hours and 24 hours time points. Due to the lack of facilities and lower costs, Listeria monocytogenes was used as a model, but to characterize the spatio-temporal evolution of organellar networks, the next step of this project will use cells infected with Mycobacterium tuberculosis (Mtb) and Legionella pneumophila (Lp), respectively. The completed framework will quantify changes in organellar shape, quantity, and spatial distribution over large sequences of Z-stack microscope images and digital videos, which will improve our understanding of how cellular mechanisms respond to their environments.
Funding Provided By: National Science Foundation (University of Georgia)
Improving the User Interface for the Online Grace Web Editor
Kelvin Kim ’17; Mentor: Kim Bruce; Collaborator: Reid Mitchell ‘16
Funding Provided By: Richter
Static Typing and Documentation Generation for the Grace Programming Language
Reid Mitchell ’16; Mentor: Kim Bruce
This summer, I assisted Prof. Kim Bruce in his work on the Grace programming language. Grace is a new language designed to facilitate the teaching of programming concepts to students. It eliminates some of the unnecessary complexities of other languages to reduce the learning curve for those new to computer science. It has been taught in introductory CS courses at Pomona and will be taught again in the fall. One of my projects was to build a documentation generator for Grace. It takes in well-commented Grace code and outputs HTML pages containing documentation of that code, similar to the way Javadoc works for Java. Initially, the compiler ignored comments entirely, so I had to modify the parser to preserve comments in the abstract syntax tree. Then I wrote a program to translate that abstract syntax tree into a set of HTML pages. My other project was to fix the statically-typed dialect of Grace. A language is statically-typed if it checks the types of all objects at compile time for correctness and consistency. Standard Grace does not do this, but static typing is a good teaching tool; thus, this dialect was created to enforce these rules for students. At the start of summer, the dialect did not work for all cases. It's still not perfect, but I have made significant progress. One major improvement was to make the compiler write information about types declared in a module to a file visible to other modules. This allows for type-checking against types from imported modules.
Funding Provided By: Richter
Syntactic tree parsing in lexical simplification tasks
Eli Fessler ’17; Mentor: David Kauchak
Our work examines the task of lexical simplification, which reduces the reading complexity of sentences by replacing certain words with simpler, more accessible variants. Previous work on the project built an initial corpus from corresponding sentence pairs extracted from English and Simple English Wikipedia, which we utilize in this project to learn word simplifications. We first train a feature-based ranking function (using SVMrank—a linear support vector machine) on a set of labeled simplifications collected using Amazon Mechanical Turk. From the collected aligned sentences, we extract simplification rules which include candidate word replacements and word replacement frequency. In applying these rules to simplify different sentences, we modified the system to automatically identify words to be changed in testing data, rather than requiring manual identification of the words to be simplified. We then created and refined simplification rules and predictive language features, e.g. a feature to compare the original and modified sentences' parse (syntax) tree. We found that in the original system, 86.25% of the sentences were changed, with a precision of 0.7615 and an accuracy of 0.6629. With the inclusion of the new feature, we saw the precision and accuracy increase to 0.7702 and 0.6668, respectively, with 85.85% of sentences modified; however, these numbers proved to be non-significant, although slight improvement did demonstrate progress in the right direction.
Funding Provided By: Fletcher Jones
The Impact of Quality Metrics on Communities Detected in Complex Networks
Jennifer Nguyen ’17; Mentor: Tzu-Yi Chen; Collaborator: Israel Gebremariam ‘19
Complex networks can be used to represent interactions between users on Facebook, members of a sports league, or neurons within the human brain. A community within a network is a set of members that are more connected to each other than to other members; communities might represent groups on Facebook or college athletic conferences in the NCAA's Division III. In many applications we are interested in identifying communities within these networks based solely on the interactions observed. The majority of commonly-used algorithms for detecting communities optimizes a metric known as modularity, which compares the density of connections of members within a community to members of other communities. However, other quality metrics such as conductance, coverage, performance, and silhouette index could also be used within those same algorithms. For this study we examined the impact of replacing modularity with coverage in existing implementations of the Louvain and CNM community detection algorithms. Preliminary results suggest that the metric used does not dramatically affect the communities identified.
Funding Provided By: Howard Hughes Medical Institute High Achievement Program (Gebremariam), Paul K. Richter and Evalyn E. Cook Richter Memorial Fund (Nguyen)
Analysis of Performance Portable Graph Algorithm Performance
Joshua Landgraf ’16; Mentor: Jeff Amelang (HMC); Collaborator: Reyna Hulett ’16 (HMC)
With the rise of multicore CPUs, powerful GPUs, and massive supercomputers, parallel computing has become increasingly relevant to both regular consumers and members of the high performance computing community. Unfortunately, many parallel programming languages and libraries result in programs being tied to only one particular type of parallel computing architecture by lacking support for other architectures or by forcing programmers to write their programs in such a way that prevents them from performing well on other architectures. However, Kokkos solves these issues by supporting multiple parallel computing architectures and handling many important details itself. This gives Kokkos the freedom to tailor program implementations to the architecture they will be running on, resulting in good performance across architectures. We tested Kokkos by using it to implement a variety of approaches to solving graph problems on CPUs and GPUs. We then compared the performance of Kokkos against OpenMP, a popular parallel programming language for CPUs, and Cuda, a popular parallel programming language for Nvidia GPUs. We also compared different approaches with each other to see which ones worked best. We found that Kokkos generally performed as well as both OpenMP and Cuda when run on sufficiently large graph problems. However, there were some instances where Kokkos was significantly better and some instances where Kokkos was significantly worse than OpenMP or Cuda.
Parallelization of Gene Sequence Alignment Tools Using SeqDB
Eleanor Swent Cawthon '15; Additional Collaborator(s): Kirsten Fagnan (Lawrence Berkeley National Laboratory); Brian (Bushnell Joint Genome Institute); Mentor(s): Katerina Antypas (Lawrence Berkeley National Lab)
Abstract: Genome researchers around the world use the FASTQ file format to represent genome sequence data. Because the data in a single FASTQ file must be accessed sequentially, this standard has created a bottleneck in the performance of sequence alignment tools. This study examines the extent to which using SeqDB, an alternative to FASTQ based on the Hierarchical Data Format, can improve the performance of the BBMap sequence alignment tool. BBMap, was modified to support input in SeqDB format natively. Throughput of BBMap's format conversion tool was measured when the same read data were given in uncompressed FASTQ, Gzip-compressed FASTQ, and SeqDB formats. The modified version of BBMap processed reads at a rate that increased at a rate of approximately 22.3 reads per millisecond per doubling of the number of threads, up to a maximum of 126 reads per millisecond with eight threads. Average throughput was 300 reads per millisecond for uncompressed FASTQ and 227 reads per millisecond for Gzip-compressed FASTQ. These rates did not vary substantially with the number of threads used. Input file size was not found to be related to SeqDB’s throughput. The results of this investigation suggest that SeqDB has the potential to be a scalable solution to one significant input and output bottleneck, but that additional changes in BBMap will be required in order for SeqDB support to match or exceed the performance of older formats.
Funding Provided by: U.S. Department of Energy Office of Science
Integrating the Grace Programming Language into DrRacket
Richard Yannow '14; Student Collaborator(s): Nicholas Cho (2015); Mentor(s): Kim Bruce
Abstract: The Grace programming language project was started with the intention of developing an object-oriented programming language that would make it easy to teach programming to novices. To this end, we have to provide not only a simple and flexible language amenable to different teaching styles and programming paradigms, but also a robust environment in which novices can learn to program. We decided to take DrRacket, an integrated development environment (IDE) for the Racket programming language, and extend it to support Grace, allowing us to take advantage of its numerous built-in beginner-friendly features through Racket’s language-binding capabilities. In order for DrRacket to understand Grace code, we wrote a parser that takes Grace code, or the surface representation, and interprets it to build an Abstract Syntax Tree (AST), or the underlying syntax; a typechecker that builds a type environment and supports the typechecking of any combination of statically-typed and dynamically-typed code; and an interpreter that translates the semantics of the AST into Racket code, so that DrRacket can evaluate it as it would any other program.
Funding provided by Pomona College SURP
Impro-Visor: Audio Input and Style Recognition
Anna Turner '15; Student Collaborator(s): Hayden Blauzvern (2016 HMC); Nate Tarrh (2014 Tufts University); Kelly Lee (2016 HMC); Mentor(s): Robert Keller (HMC)
Abstract: We present research on Impro-Visor, intelligent music software dedicated to helping both beginner and expert jazz musicians improve their playing. We first introduce the creation of audio input capabilities. We accomplish this through SuperCollider, a programming language used for audio synthesis. We use pitch and onset detection to detect notes and rests, which we then send as MIDI input to Impro-Visor. We integrate existing external software to make this feature as portable as possible. We also describe an automated method for style recognition of jazz melodies through the use of supervised training. We train a neural network to recognize defining stylistic elements of specific musicians. We then present melodies to a critic for judgment on a grading scale, and for prediction of the musicians to whom the melodies sound most similar.
Funding Provided by: National Science Foundation (HMC)
Walking in place using the Microsoft Kinect to explore a large VE
Kevin Nguyen '16; Student Collaborator(s): Preston Tunnell Wilson (2016 Rhodes College); Additional Collaborator(s): Kyle Dempsey (Mississippi University for Women); Mentor(s): Betsy Williams-Sanders (Rhodes College)
Abstract: One way to permit free exploration of any sized virtual environment (VE) and provide some of the inertial cues of walking is to have users “walk in place” (WIP) [Williams et al. 2011]. With WIP, each step is treated as a virtual translation even though the participant remains in the same location. In our prior work [Williams et al. 2011], we had success in implementing a WIP method using an inexpensive Nintendo Wii Balance Board. We showed that participants’ spatial orientation was the same as normal walking and superior to joystick navigation. There were two major drawbacks to this WIP algorithm. First, our step detection algorithm had a half–step lag. Second, participants found it slightly annoying to walk in place on the small board. Thus, the current work seeks to overcome these limitations by implementing an algorithm to WIP using two Microsoft Kinect sensors (150 USD each). Specifically, we are interested in seeing how well users can explore a large VE by WIP with the Kinect (WIP–K). Due to the large size of the VE, comparing these results to normal physical walking is not possible. Therefore, we directly compare WIP–K to joystick navigation. Also, we examine scaling the translational gain of WIP–K so that one “step” carries the user forward two steps (referred to as WIP–K x 2). Thus, this within–subject experiment compares subjects’ spatial orientation as they navigate a VE in three conditions: Joystick, WIP–K, WIP–K x 2.
Funding provided by Pomona College SURP
Examining graph clustering stability
Evan Fields '13; Mentor(s): Tzu-Yi Chen
Abstract: A graph is a collection of objects (nodes) and connections between pairs of objects (edges). Graphs have been used to study phenomena such as social networks, where the nodes represent people and edges represent friendships. Partitioning nodes into “clusters” such that two nodes within a cluster are more likely to be connected than two nodes from separate clusters can reveal structures such as communities in social networks. Many clustering algorithms have been proposed. However, popular measures of clustering strength are computationally unfeasible to maximize and algorithms can, at best, compute reasonably good clusterings. More importantly, clustering algorithms will return a partition of the nodes even for graphs lacking community structure. We investigate how to determine whether the returned clusters represent a meaningful structure in the graph. We present an algorithm-agnostic method of examining stability (and thus meaningfulness of clusters): after running a clustering algorithm on the initial graph, we add a single edge not present in the original graph and re-run the clustering algorithm. The distance between the clustering on the original and on the modified graphs is recorded, and this experiment is repeated for a large number of edges not present in the original graph. We hypothesize that the distribution of recorded distances carries information about stability and present experimental data collected on a variety of synthetic and real-world graphs.
Anonymity in Online Communities
Eli Omernick '14; Mentor(s): Sara Sood
Abstract: With the ever-expanding scope of computer-mediated communication, especially in this age of social media and instant communication, there are some interesting and meaningful questions being raised on how we communicate and the subsequent implications; specifically we have investigated the influence of anonymity on the behavior of Internet users. TechCrunch is a technology news site, which posts articles and allows readers to post comments in response. On March 1st, 2011, TechCrunch switched from the Disqus commenting platform to the Facebook commenting platform, marking the end of condoned anonymity in their online community. They did this in the name of “Troll Slaying,” or the attempt of reducing intentionally negative or destructive user contributions. We looked at trends between the two corpora as wholes as well as between the user groups (which we characterize as having varying degrees of anonymity, from totally anonymous, to a pseudonym, to using users’ real names) within the individual corpora. We evaluated comments in terms of several qualitative (e.g. Readability, Relevance, Word Usage) as well as quantitative (e.g. Comments/Article, Comments/User, Comment Length) metrics.
Programming in Grace
Amy Ruskin '14; Student Collaborator(s): Richard Yannow (2014); Mentor(s): Kim Bruce
Abstract: Grace is a programming language that is currently in development with the eventual goal of being used to teach introductory computer science courses. I programmed extensively in Grace to find some remaining bugs and provide feedback on the experience of actually using the language. In the end, I produced Grace code for various data structures, using Java structures as guides, and translated some of the projects and assignments from Pomona's Data Structures and Advanced Programming (CS62) into Grace. Most of the problems encountered were due to features of the language that were not yet fully implemented and the lack of extensive and current documentation, but once those issues are resolved, Grace should be easy to learn and straightforward to write.
Funding Provided by: Pomona College SURP
Adapting Object-Oriented Languages for Instructional IDEs
Richard Yannow '14; Mentor(s): Kim Bruce
Abstract: The Grace programming language project was started with the intention of making a new object-oriented language for teaching the practice of programming. In order to be successful, Grace must be easily usable by novices, and a significant factor towards that goal is having a beginner-friendly integrated development environment (or IDE). We decided to use DrRacket as an IDE for Grace, allowing us to take advantage of its numerous novice-friendly features and Racket's powerful language-building capabilities. We developed a new backend language, Racket-Grace, with Grace's semantics, but with Racket-style syntax. We also wrote a pretty-printer that will take processed abstract syntax trees of Grace code, and return an equivalent Racket-Grace program. This will allow us to input a Grace program into DrRacket and run it there, translating it into Racket-Grace as an under-the-hood intermediate step, allowing us to maintain compatibility with DrRacket's many useful tools.
Funding provided by Pomona College SURP