Browsing by Author "Merényi, Erzsébet"
Now showing 1 - 6 of 6
Results Per Page
Sort Options
Item Classification of hyperspectral imagery with neural networks: comparison to conventional tools(Springer, 2014) Merényi, Erzsébet; Farrand, William H.; Taranik, James V.; Minor, Timothy B.Efficient exploitation of hyperspectral imagery is of great importance in remote sensing. Artificial intelligence approaches have been receiving favorable reviews for classification of hyperspectral data because the complexity of such data challenges the limitations of many conventional methods. Artificial neural networks (ANNs) were shown to outperform traditional classifiers in many situations. However, studies that use the full spectral dimensionality of hyperspectral images to classify a large number of surface covers are scarce if non-existent. We advocate the need for methods that can handle the full dimensionality and a large number of classes to retain the discovery potential and the ability to discriminate classes with subtle spectral differences. We demonstrate that such a method exists in the family of ANNs. We compare the maximum likelihood, Mahalonobis distance, minimum distance, spectral angle mapper, and a hybrid ANN classifier for real hyperspectral AVIRIS data, using the full spectral resolution to map 23 cover types and using a small training set. Rigorous evaluation of the classification accuracies shows that the ANN outperforms the other methods and achieves ?90% accuracy on test data.Item Large Scale Online Aggregation Via Distributed Systems(2014-12-04) Pansare, Niketan; Jermaine, Chris; Nakhleh, Luay; Merényi, ErzsébetFrom movie recommendations to fraud detection to personalized health care, there is growing need to analyze huge amounts of data quickly. To deal with huge amounts of data, many analysts use MapReduce, a software framework that parallelizes computations across a compute cluster. However, due to the sheer volume of data, MapReduce is sometimes still not fast enough to perform complicated analysis. In this thesis, I address this problem by developing a statistical estimation framework on top of MapReduce to provide for interactive data analysis. I present three projects I have worked on under this topic. In the first project, I consider extending Online Aggregation (OLA) to a MapReduce environment. Online aggregation (OLA) allows the user to compute an arbitrary aggregation function over a data set and output probabilistic bounds on accuracy in online fashion. OLA in a relational database system uses classical sampling theory to estimate confidence bounds. The key difference in a large-scale distributed computing environment is the importance of block-based processing. At a random time instance, the system is likely processing blocks that take longer to process. Hence blocks that take longer to process are less likely to be taken into account when an estimate is generated. Since one might expect correlation between processing time and the aggregated value of a block, the estimates for the aggregate can be biased. To address the inspection paradox, I propose a Bayesian model that utilizes a joint prior over the values to be aggregated and time take to process/schedule each block. Since the model is taking timing information into account, the bias is removed. This model is implemented on Hyracks, an open-source project similar to Hadoop, the most popular implementation of MapReduce. In the second project, I consider implementing gradient descent on top of MapReduce. Gradient descent is an optimization algorithm that finds the local minima of a function L(w) by starting with an initial point $w_0$ and then taking steps in direction of negative gradient of the function to be optimized. The computation of the gradient is referred to as an epoch and gradient descent computes many epochs iteratively until completion. If the number of data points N is very large it can take lot of time to compute the aggregate for an epoch k. Since the gradient descent algorithm is essentially a user-defined aggregate function, the OLA framework developed in the first part of my thesis can be used to speed up this algorithm in a MapReduce framework. The key technical question that must be answered is “When do we stop the OLA estimation for a given epoch?”. In this thesis, I propose and evaluate a new statistical model for addressing this question Finally, I design, implement, and evaluate a particular machine learning algorithm. An extremely popular feature selection methodology is topic modeling . A topic is defined as a probability distribution over sets of words or phrases and each document in the corpus is drawn from mixture of these topics. A topic model for a corpus specifies the set of topics, as well as the proportion in which they are present in any given document. The recent interest in topic models has been driven by the explosion of electronic, text based data that are available for analysis. From web pages to emails to microblogs, text-based data are everywhere. However, not all electronically-available natural language corpora are text-based. In my thesis, I consider the problem of learning topic models over spoken language. My work is motivated by our involvement with the Spoken Web (also called the World Wide Telecomm Web), which allows users in rural India to post farming-related questions and responses to an audio forum using mobile phones. I propose a new topic model that leverages the statistical algorithms used in most modern speech-to-text software. I develop alternative version of the popular LDA topic model called the spoken topic model, or STM for short. This model uses a Bayesian interpretation of the output of a speech-to-text software that takes into account the software's explicit uncertainty description (the phrases and weights) in a principled fashion.Item Self-Organizing Maps for Segmentation of fMRI: Understanding the Genesis of Willed Movement Through A Multiple Subject Study(2018-04-20) O'Driscoll, Patrick; Merényi, ErzsébetThe neural process of executing willed movements, though fundamental to human activity, is not well understood. Analyzing the neural activity of human subjects performing willed movement encoded in functional Magnetic Resonance Imaging (fMRI) data may further our understanding of the genesis of willed movement. fMRI encodes the neural activity in the Blood Oxygen Level Dependence (BOLD) signal in each voxel, volume element, of the fMRI data. Model-free, data-driven methods to cluster voxels based on neural activity have become an increasingly common approach for generating interpretable brain maps. The hypothesis is: grouping voxels based upon the similarity of their time course should group voxels into functional regions of the brain. Furthermore, by examining the relationships between clusters, the relationships between brain regions can be revealed. Clustering is an ill-posed problem with many solutions and finding interpretable or good results is notoriously difficult. Self-Organizing Maps (SOMs) are adept at learning the data structure, and provide a means of visualizing and interpreting large, complex, high-dimensional datasets. SOMs learn the data manifold by placing prototypes into the dataspace to represent nearby datapoints. By clustering or grouping the SOM prototypes, one clusters the data represented by these prototypes. Currently, the most powerful SOM clustering methods rely on interactive visualizations and user expertise for cluster extraction. However, solutions that require less user time, increase reproducibility, and avoid the user bias and subjectivity are desirable. This work proposes a novel Combined Connectivity and Spatial Adjacency (CCSA) measure to be used in a Hierarchical Agglomerative Clustering (HAC) method to automatically extract clusters from SOMs. CCSA measure uses information available to the SOM in three distinct, latent dataspaces. Comparing HAC with CCSA to the most popular automated SOM clustering algorithms on synthetic datasets shows improved performance. HAC with CCSA extracts equivalent or better clusters than the interactive method on real fMRI multiple subject study of the genesis of willed movement. By examining the consistency across all subjects and by comparing the autocorrelation between the clusters extracted in each subject from the fMRI data, a medical model and the relationships between the brain regions can be understood.Item The Dark Machines Anomaly Score Challenge: Benchmark Data and Model Independent Event Classification for the Large Hadron Collider(SciPost Foundation, 2022) Aarrestad, Thea; van Beekveld, Melissa; Bona, Marcella; Boveia, Antonio; Caron, Sascha; Davies, Joe; de Simone, Andrea; Doglioni, Caterina; Duarte, Javier; Farbin, Amir; Gupta, Honey; Hendriks, Luc; Heinrich, Lukas A.; Howarth, James; Jawahar, Pratik; Jueid, Adil; Lastow, Jessica; Leinweber, Adam; Mamuzic, Judita; Merényi, Erzsébet; Morandini, Alessandro; Moskvitina, Polina; Nellist, Clara; Ngadiuba, Jennifer; Ostdiek, Bryan; Pierini, Maurizio; Ravina, Baptiste; Ruiz de Austri, Roberto; Sekmen, Sezen; Touranakou, Mary; Vaškeviciute, Marija; Vilalta, Ricardo; Vlimant, Jean-Roch; Verheyen, Rob; White, Martin; Wulff, Eric; Wallin, Erik; Wozniak, Kinga A.; Zhang, ZhongyiWe describe the outcome of a data challenge conducted as part of the Dark Machines Initiative and the Les Houches 2019 workshop on Physics at TeV colliders. The challenged aims at detecting signals of new physics at the LHC using unsupervised machine learning algorithms. First, we propose how an anomaly score could be implemented to define model-independent signal regions in LHC searches. We define and describe a large benchmark dataset, consisting of >1 Billion simulated LHC events corresponding to 10 fb−110 fb−1 of proton-proton collisions at a center-of-mass energy of 13 TeV. We then review a wide range of anomaly detection and density estimation algorithms, developed in the context of the data challenge, and we measure their performance in a set of realistic analysis environments. We draw a number of useful conclusions that will aid the development of unsupervised new physics searches during the third run of the LHC, and provide our benchmark dataset for future studies at https://www.phenoMLdata.org. Code to reproduce the analysis is provided at https://github.com/bostdiek/DarkMachines-UnsupervisedChallenge.Item The Platform-Aware Compilation Environment: Preliminary Design Document(2010-09-15) Cooper, Keith D.; Mellor-Crummey, John; Merényi, Erzsébet; Sadayappan, P.; Sarkar, Vivek; Torczon, Linda; Burke, Michael G.The Platform-Aware Compilation Environment (PACE) is an ambitious attempt to construct a portable compiler that produces code capable of achieving high levels of performance on new architectures. The key strategies in PACE are the design and development of an optimizer and runtime system that are parameterized by system characteristics, the automatic measurement of those characteristics, the extensive use of measured performance data to help drive optimization, and the use of machine learning to improve the long-term effectiveness of the compiler and runtime system.Item The Platform-Aware Compilation Environment: Status and Future Directions(2012-06-13) Cooper, Keith D.; Khan, Rishi; Lele, Sanjiva; Mellor-Crummey, John; Merényi, Erzsébet; Palem, Krishna; Sadayappan, P.; Sarkar, Vivek; Tatge, Reid; Torczon, LindaThe Platform-Aware Compilation Environment (PACE) is an ambitious attempt to construct a portable compiler that produces code capable of achieving high levels of performance on new architectures. The key strategies in PACE are the design and development of an optimizer and runtime system that are parameterized by system characteristics, the automatic measurement of those characteristics, the extensive use of measured performance data to help drive optimization, and the use of machine learning to improve the long-term effectiveness of the compiler and runtime system.