Northeastern University
Sign Up
Free Event

Title: Sparsity Against Exponential Complexity: Parallel Implementation of Separate Testing of Inputs and Authorship Attribution via Sparse Stochastic Context Trees Modeling

Speaker: Paul Grosu, MS Candidate, College of Computer and Information Science at Northeastern University

Location: Northeastern University, 440 Huntington Avenue, West Village H, 3rd Floor, Room #366, Boston, Massachusetts 02115

Abstract

Many problems of interest can be modeled as follows: we have a large number of features (inputs) to which an unknown function is applied producing an output, where only a small but unknown subset of the features influences the output; our goal is to identify those "active" input features that influence the output.  For example, consider medical diagnostics: we may have a disease of interest and hundreds of possible input features (symptoms, genetic markers, blood tests results, etc.), and our goal is to identify the subset of features that are indicative of or influence the presence or absence of the disease.

The first goal is for identifying significant sparse features in data, by applying information-theoretic methods via parallel programming as fast solvers.  For 100 years, the idea of Response Surface Methodology (RSM) ─ proposed by Sir Ronald Fisher ─ is the optimization of a response variable that is influenced by several variables, which has only made partial progress by utilizing a combinatorial approach.  We have expanded the RSM approach to assume sparsity, in which only a few variables of many have an effect on the outcome of an experiment.  By using a random sample of points of Complete Factorial Design and Separate Testing of Inputs based on Empirical Shannon Information ESI, our method dramatically outperforms the extremely popular LASSO both in the necessary sample size and in the processing time (via a parallel implementation).

The second goal is to continue to modify and expand ESI for application for authorship attribution.  Literary corpora have sparse set of contexts modeled as a Stochastic Context Tree (SCOT).  SCOTs expand on the fundamental approach by Jorma Rissanen of a Universal Data Compression System, which is more optimal than the popular Ziv-Lempel algorithm. Parallel training SCOT on a cluster of several hundred computers enables efficient microstyle authorship attribution of literary texts, and specifying contexts most contributing to discrimination of styles of different authors via Homogenity Testing and Follow-up Analysis.

Committee 

Javed Aslam, Associate Dean of Faculty and Research, College of Computer and Information Science at Northeastern University
Mikhail Malioutov, Professor, College of Science at Northeastern University
Virgil Pavlu, Associate Research Science

Event Details

0 people are interested in this event

User Activity

No recent activity