Information Retrieval and Web Search Engine
A search engine hosted on Amazon Web Services to answer queries related to Polar data. This was done as part of a class project at USC.
1. We used Apache Nutch to crawl data from 3 NASA sites namely AMD, ACADIS and ADE. We developed algorithms to detect exact and near duplicates within the crawled documents and developed plugins for Apache Nutch based on the algorithms. We also used Apache Tika to extract metadata features about the documents and Selenium to crawl deep web.
2. We developed content based and link analysis algorithms similar to PageRank and HITS to index data into Apache Solr. Metadata features such as locations present in the document extracted using CLAVIN and date and time information was used in the indexing of documents.
3. We used libraries like D3, Banana dashboard and FacetView to allow users to look for information in the data indexed on Apache Solr.
Here’s a video about our search engine.
A problem solving intelligent agent written in C++ that is capable of playing a game called Riversi. I designed the heuristics for the game playing agent and implemented well known algorithms like iterative deepening with alpha beta pruning and minimax which enabled the agent to generate a game tree and decide the best possible move to make at different points in the game. The agent was able to reach the quarter finals in a class tournament where the class strength was about 300. More information about the agent can be found here.
Quantifying Surprise in Text
According to Wikipedia, surprise is a brief mental and physiological state, a startle response, experienced by humans as the result of an unexpected event. Therefore, a world that is purely deterministic or predictable in real time for a given observer contains no surprises. In this project, we try to quantify surprise a user experiences when he reads tweets and based on the ‘wow’ factor predict if the tweet is going to trend on twitter. Check out the paper here.
Online Programming Contest Judge
A scalable framework written using django python that accepts code written by users, compiles and executes their programs and determines the validity of the programs by checking against a set of test cases. Along with the front end development, which involved working on the admin site, and the user site, I also worked on the algorithms that would handle grading of the users and other APIs that would improve the scalability of the framework. The code for the judge can be found here.
An android application that uses zillow.com’s REST APIs to search for properties in the United States. I developed the entire assignment for a class project and also added features like sharing property information with friends on Facebook. More information about the app can be found here.
Topic Segmentation is a process of dividing huge chunks of texts into it’s component topics. Topic segmentation usually is the starting step in variety of other functions such as automatic text summarization, automatic text labeling etc. I used python to implement the TextTiling algorithm designed by Marti Hearst. The algorithm uses information about frequency of words and it’s distribution to determine blocks of cohesive information. I modified the algorithm to perform more accurate segmentation by using structural cues present in the document. I also included a word stemming routine which has been left out in the original algorithm leading to a better performance. The code can be found here
Optical character recognition has it’s application in areas like zip code identification, automatic reading of bank cheques etc. I wrote an octave program that utilizes neural networks (the backpropogation algorithm) to identify handwritten digits. I implemented the solution as part of the online machine learning course taught by professor Andrew Ng. The algorithm identifies the arabic digits with around 95% accuracy. The code can be found here.