This page serves as a portfolio of projects that I have done from the early undergraduate days, when I was first exposed to large-scale data analysis and the R language, until the end of my Ph.D. journey. Projects whose data are non-proprietary are open-source and put on GitHub. Respective publications are also listed, if available. All projects are listed in reverse chronological order.
- Law Enforcement Resource Optimization
- Real-time Dwell Time Prediction Using Passive Wi-Fi Data
- Urban Transportation
- Human Mobility Analytics
- Image Classification
- Social Network Analysis
- Modeling Hospital Length of Stay
(Aug 2016 – Jun 2017)
In today’s world of heightened security concerns, there has been an increased pressure on law enforcement agencies around the world to deploy resources as efficiently as possible to respond to crimes and emergent incidents. Such pressure is even more aggravated by the global urbanization trends, resulting in manpower crunches on law enforcement agencies trying meet the rising demands in large and densely populated urban areas. With the help of big spatiotemporal data that provides fine-grained details of crime incidents, it is now possible to design adaptive and data-driven staffing models and predictive policing policies that optimize the allocation of resources and improve responses to incidents. In this project, we are provided with multi-year fine-grained crime data from a national law enforcement agency in order to provide optimal data-driven solutions to the agency.
(May – Jul 2016)
The proliferation of wireless technologies in today’s everyday life is one of the key drivers of the Internet of Things (IoT). In addition, the vast penetration of wireless devices gives rise to a secondary functionality as a means of tracking and localization. In order to connect to known Wi-Fi networks, mobile devices have to scan and broadcast probe requests on all available channels, which can be captured and analyzed in a non-intrusive manner. Thus, one of the key applications of this feature is the ability to track and analyze human behaviors in real-time directly from the patterns observed from their Wi-Fi-enabled devices. In this project, we develop such a system to obtain these Wi-Fi signatures in a completely passive manner coupled with adaptive machine learning techniques to predict in real-time the expected dwell time of the device owners at a specific location. We empirically evaluate the proposed system using real-world Wi-Fi data collected at retail locations that have diverse human mobility patterns. The system was mostly implemented in Python using scikit-learn and Wireshark packages.
This was my summer internship project at IBM Research – Singapore.
(Jun 2015 – Apr 2016)
Reinforcement learning is at the heart of modern AI techniques. It is the underpinning of recent breakthroughs such as AlphaGo and driverless cars. In this project, I apply the main ideas of reinforcement learning called Q-learning to an interesting hypothetical application called Smartcab. Smartcab is a simplistic simulation of a self-driving car from the not-so-distant future that ferries people from one arbitrary location to another. This application demonstrates how Q-learning can be used to train smart smartcabs to desired self-driving behaviors through trials and errors in a well-defined environment and reward structure. The application is developed using Python and the Pygame framework. This project was inspired from an original Udacity coursework, which now offers an entire self-driving car course!
Traffic speed is a key indicator of the efficiency of an urban transportation system. This project addresses the problem of efficient and fine-grained speed prediction using big traffic data obtained from traffic sensors. Gaussian processes (GPs) have been used to model various traffic phenomena; however, GPs do not scale with big data due to their cubic time complexity. We address such efficiency issues by proposing local GPs to learn from and make predictions for correlated subsets of data. The main idea is to quickly group speed variables in both spatial and temporal dimensions into a finite number of clusters, so that future and unobserved traffic speed queries can be heuristically mapped to one of such clusters. A local GP corresponding to that cluster can then be trained on the fly to make predictions in real-time. We call this localization, which is done using non-negative matrix factorization. We additionally leverage on the expressiveness of GP kernel functions to model the road network topology and incorporate side information. Extensive experiments using real-world traffic data show that our proposed method significantly improve both the runtime and prediction performances.
(Jan 2013 – May 2015)
We collected big mobility data of visitors to the resort island of Sentosa in Singapore. The data was collected through bundle packages such as Day Pass and associated RFID-enabled devices. We propose various analytical frameworks to model and predict the visitors’ trajectories. Our methods include revealed preference analysis and reinforcement learning. Applications include designing bundle packages that optimize visitors’ experiences and theme park’s revenue, and recommending optimal plans to the visitors to avoid congestions and reduce queue lengths.
We consider the problem of trajectory prediction, where a trajectory is an ordered sequence of location visits and corresponding timestamps. The problem arises when an agent makes sequential decisions to visit a set of spatial POIs. Each location bears a stochastic utility and the agent has a limited budget to spend. Given the agent’s observed partial trajectory, our goal is to predict the agent’s remaining trajectory. We propose a solution framework to the problem that incorporates both the stochastic utility of each location and the agent’s budget constraint. We first cluster the agents into “types”. Each agent’s trajectory is then transformed into a discrete-state sequence representation. We then use reinforcement learning (RL) to model the underlying decision processes and inverse RL to learn the utility distributions of the locations. We finally propose two decision models to make predictions: one is based on long-term optimal planning of RL and another uses myopic heuristics. We apply the framework to predict real-world human trajectories and are able to explain the underlying processes of the observed actions.
We propose the problem of predicting a bundle of goods, where the goods considered is a set of spatial locations that an agent wishes to visit. This typically arises in the tourism setting where attractions can often be bundled and sold as a package. We look at the problem from an economic point of view. That is, we view an agent’s past trajectories as revealed preference (RP) data, where the choice of locations is a solution to an optimization problem according to some unknown utility function and subject to the prevailing prices and budget constraint. We approximate the prices and budget constraint as the time costs to visit the chosen locations. We leverage on a recent line of work that has established efficient algorithms to recover utility functions from RP data and adapt it to our original setting. We experiment with real-world trajectories and our predictions show significantly improved accuracies compared with the baseline methods.
(Aug – Dec 2014)
We perform the image classification task on the CIFAR-10 dataset, where each image belongs to one of the 10 distinct classes. The classes are mutually exclusive and are mostly objects and animals. The images are small, of uniform size and shape, and are RGB colored. We implement an image preprocessing framework to learn and extract the salient features of the training images. With preprocessing, we see a considerable improvement of more than 15% from the baseline (i.e., without preprocessing). We also experiment with a variety of classifiers on the preprocessed images and find out that a linear SVM performs the best. We finally experiment with ensemble learning by combining a SVM with a multinomial logistic regression, which marginally improves on the linear SVM at a high complexity.
(Jan – May 2013)
We collected data from Stack Overflow for the duration of two years (2011–2012) and constructed a large network of users (i.e., each node is a unique user) that represents the relationships between ‘askers’ and ‘answerers’ on the platform. We call that the network of expertise. We analyzed the network structure and reciprocal patterns among its users. We used network centrality measures, principal component analysis, and linear regression to describe and characterize what makes an expert user in the constructed graph.
(Jan 2008 – May 2009)
We collected a large-scale dataset over four years (2004–2007) of stroke patients admitted to the Singapore General Hospital. The dataset contained the typically long-tailed distribution of hospital length of stay (LOS) as the primary health-care outcome, and other (time-dependent) explanatory variables. We used survival models to analyze the risk factors affecting the patients’ LOS and proposed Coxian phase-type distributions to model the distribution of LOS and analyze its trend over time.
Truc Viet Le, Chee Keong Kwoh, Kheng Hock Lee and Eng Soon Teo. Trend Analysis of Length of Stay Data via Phase-type Models. International Journal of Knowledge Discovery in Bioinformatics (IJKDB). [Read more!]