My research deals with developing and analyzing novel efficient algorithms for learning and inference, and applying these algorithms in challenging real world domains. My research interests are mainly related to statistical machine learning and more specifically to the fields of graphical models and deep learning. What is perhaps the most distinctive about the graphical model approach is its naturalness in formulating probabilistic models of complex phenomena, while maintaining control over the computational cost associated with these models. Deep learning methods attempt to model high level abstractions in data by using a deep graph with multiple processing layers, composed of multiple linear and non-linear transformations. As often pointed out, the same machine learning models and algorithms can be applied in many different research areas. In my research I concentrate on developing and analyzing those algorithms in the context of classical machine learning tasks (classification, clustering, dimensionality reduction etc.) and applying them to a large variety of real world applications (computer vision, language processing, audio processing, medical imaging etc.).

### Selected Projects

- Deep learning methods for speech enhancement
- Deep learning classification with noisy labels
- Deep learning for multi-view mammogram classification
- Context-sensitive distributional models
- Message-passing algorithms for MIMO wireless communication
- Non-parametric differential entropy estimation
- Textual entailment graphs
- Medical image retrieval in large datasets
- Medical image segmentation and lesion detection
- LDPC serial scheduling
- Neighborhood Components Analysis (NCA)
- Mixture of Gaussians, distance and simplification

**Deep learning methods for speech enhancement ** Deep neural networks have recently became a viable methodology for single microphone speech enhancement. The most common approach, is to feed the noisy speech features into a fully connected DNN to either directly enhance the speech signal or to infer a mask which can be used for the speech enhancement. In this case, one network has to deal with the large variability of the speech signal. we propose a deep mixture of experts architecture that addresses these two issues. In order to reduce the large speech variability, we split the network into a mixture of networks (denoted experts), each of which specializes in a specific and simpler task and a gating network. Experimental study shows that the proposed algorithm produces higher objective measurements scores compared to both a single DNN. Joint work with Shlomi Chazan and Sharon Gannot.

**Deep learning classification with noisy labels ** The availability of large data-sets has enabled neural networks to achieve impressive recognition results. However, the presence of inaccurate class labels is known to deteriorate the performance of even the best classifiers in a broad range of classification problems. Noisy labels also tend to be more harmful than noisy features. When the observed label is noisy, we can view the correct label as a latent random variable. In a line of project we tried to define neural network architecture that can explicitly take care of the presence of training data with unalienable labels. Joint work with Ehud Ben-Reuven and Alan Bekker.

*ICLR*, 2017. code

*Odyssey, 2016.*

Training deep neural-networks based on unreliable labels. ICASSP, 2016.
Combining soft decisions of several unreliable experts. ICASSP, 2016.

Multi-view probabilistic classification of breast microcalcifications. IEEE TMI, 2016.
A multi-view deep learning architecture for classification of breast microcalcifications. ISBI, 2016.

Context2vec: learning generic context embedding with bidirectional LSTM. CoNLL, 2016.
Modeling word meaning in context with substitute vectors. NAACL, 2015.
Probabilistic modeling of joint-context in distributional similarity. CoNLL, 2014. Best paper runner up.
A two level model for context sensitive inference rules. ACL, 2013. Best paper runner up.

**Deep learning for multi-view mammogram classification.** In this project we address the problem of differentiating between malignant and benign tumors based on their appearance in the CC and MLO mammography views. Classification of clustered breast microcalcifications into benign and malignant categories is an extremely challenging task for computerized algorithms and expert radiologists alike. We applied a deep-learning classification method that is based on two view-level decisions, implemented by two neural networks, followed by a single-neuron layer that combines the view level decisions into a global decision that mimics the biopsy results. Joint work with Alan Bekker and Hayit Greenspan.

**Context-sensitive distributional models ** Context representations are central to various NLP tasks, such as word sense disambiguation, named entity recognition, coreference resolution, and many more. In this research we present models for learning a generic context representation function from large corpora. One approach is based on using bidirectional LSTM. Another approach is based on substitute vectors. Joint work with Oren Melamud and Ido Dagan.

**Message-Passing Algorithms for MIMO Wireless Communication. ** The detection problem for MIMO communication systems is known to be NP-hard. The factor graph that corresponds to this problem is very loopy; in fact, it is a complete graph. Hence, a straightforward application of the Belief Propagation (BP) algorithm yields very poor results. We have developed several methods that either modify the cost function we want to optimize or modify the BP messages in such a way that the BP algorithm yields improved performance. One approach is based on an optimal tree approximation of the Gaussian density of the unconstrained linear system. The finite-set constraint is then applied to obtain a cycle-free discrete distribution. Another approach is based on two-dimensional Gaussian projections and a third method is based on imposing priors on the BP messages. Joint work with Amir Leshem.

Improved MIMO detection based on successive tree approximations . ISIT, 2013. C code

Iterative tomographic solution of integer least squares problems with applications to MIMO detection. IEEE Journal of Selected Topics in Signal Processing, 2011.

MIMO detection for high-order QAM based on a Gaussian tree approximation. IEEE Trans. Information Theory, 2011.

Pseudo prior belief propagation for densely connected discrete graphs. IEEE Information Theory Workshop (ITW), 2010.

A Gaussian tree approximation for integer least-squares. NIPS, 2009.

** Non-parametric Differential Entropy Estimation.** Estimating the differential entropy given only sampled points without any prior knowledge on the distribution is a difficult task. We proposed the Meann-NN estimator for the main information theoretic measures such as differential entropy, mutual information and KL-divergence. The Mean-NN estimator is related to classical k-NN entropy estimations. However, Unlike the k-NN based estimator, the Mean-NN entropy estimator is a smooth function of the given data points and is not sensitive to small perturbations in the values of the data. Hence, it can be used within optimization procedures that are based on computing the derivatives of the cost function we optimize. We demonstrated the usefulness of the Mean-NN entropy estimation technique on the ICA problem, on clustering and on supervised and unsupervised dimensionality reduction. Joint work with Lev Faivishevsky.

Dimensionality reduction based on non-parametric mutual information. Neurocomputing 2012.

Unsupervised Feature Selection based on Non-Parametric Mutual Information. MLSP, 2012.

A nonparametric information theoretic clustering algorithm. ICML, 2010.

ICA based on a smooth estimation of the differential entropy. NIPS 2008.

** Textual Entailment Graphs.** The objective of Textual Entailment is to automatically recognize whether a target textual hypothesis can be inferred from an input text. We can view this problem as the problems of learning a global semantic graph, where each node is a predicate and the graph describes all entailment rules between them. This allows us to search for semantic graphs that satisfy certain structural properties. For example, if 'cause an increase' entails 'raise' and 'raise' entails 'affect', then 'cause an increase' must entail 'affect'. This is known as the property of transitivity. We have developed algorithms that exploit this property to obtain improved semantic rules. We have also utilized other structural properties of semantic graphs such as sparseness (i.e., most predicates do not entail one another) and hierarchy (i.e., entailment relations usually form a hierarchy of predicates) to improve rule learning. The general learning problem can be formed as an Integer Linear Programming which is NP hard. A crucial challenge in utilizing structural information is to develop efficient algorithms that can be computed in reasonable time even on very large graphs. Joint work with Jonathan Berant and Ido Dagan.

Efficient tree-based approximation for entailment graph learning. ACL 2012.

Learning entailment relations by global graph structure optimization. Computational Linguistics 2012.

Global learning of typed entailment rules. ACL 2011.

Global learning of focused entailment graphs. ACL 2010.

** Medical Image Retrieval in Large Datasets.** Retrieving similar cases from a large archive is a very challenging task and is one of the key issues in the rapidly expanding domain of content-based medical image retrieval. We developed an efficient image categorization and retrieval system applied to medical image databases, in particular large radiograph archives. The methodology is based on local patch representation of the image content, using the visual words approach. We explore the effects of various parameters on system performance, and show best results using dense sampling of simple features with spatial content, and a non-linear kernel-based Support Vector Machine (SVM) classifier. Our system was ranked first in the ImageCLEF 2009 medical annotation task. In a later work we developed a localized image lassification system. The system discriminates between healthy and pathological cases and indicates the subregion in the image that is automatically found to be most relevant for the decision. Joint work with Uri Avni and Hayit Greenspan.

X-ray categorization and spatial localization of chest pathologies. MICCAI, 2011.

X-ray categorization and retrieval on the organ and pathology level. IEEE Trans. Medical Imaging, 2011.

Addressing the ImageClef 2009 Challenge using a patch-based visual words. CLEF, 2009.

** Medical Image Segmentation and Lesion detection.** Automatic segmentation of brain MRI images to the three main tissue types is a topic of great importance and much research. A realted task is detection and segmentation of Multiple Sclerosis (MS) lesions in brain images. To capture the complex tissue spatial layout, we used probabilistic model that is based on a mixture of multiple spatially oriented Gaussians per tissue. Another medical imaging challenge is lesion detection and segmentation in uterine cervix images. The image is modeled using an MRF in which watershed regions correspond to binary random variables indicating whether the region is part of the lesion tissue or not. The local pairwise factors on the arcs of the watershed map indicate whether the arc is part of the object boundary. The factors are based on supervised learning of a visual word distribution. The final lesion region segmentation is obtained using a loopy belief propagation applied to the watershed arc-level MRF. Joint work with Amit Ruf, Oren Freifled, Amir Alush and Hayit Greenspan.

Automated and interactive lesion detection and segmentation in uterine cervix Images. IEEE Trans. on Medical Imaging, 2010.

Multiple sclerosis lesion detection using constrained GMM and curve evolution. Int. Journal of Biomedical Imaging, 2009.

Constrained Gaussian mixture model framework for automatic segmentation of MR brain images. IEEE Trans. on Medical Imaging, 2006.

**LDPC Serial Scheduling.** LDPC is decoded by running an iterative belief-propagation algorithm over the factor graph of the code. In the traditional message passing schedule, in each iteration all the variable nodes, and subsequently all the factor nodes, pass new messages to their neighbors. We showed that serial scheduling, in which messages are generated using the latest available information, significantly improves the convergence speed in terms of number of iterations. It was observed experimentally in several studies that the serial schedule converges in exactly half the number of iterations compared to the standard parallel schedule. We provided a theoretical motivation for this observation by proving it for single-path graphs. Joint work with Haggai Kfir, Eran Sharon and Simon Litsyn.

Serial schedules for belief-propagation: analysis of convergence time. IEEE Trans. on Information Theory, 2008.

Efficient serial message-passing schedules for LDPC decoding. IEEE Trans. on Information Theory, 2007.

An efficient message-passing schedule for LDPC decoding. Proc. Electrical and Electronic Engineers in Israel, 2004.

** Neighbourhood Components Analysis (NCA) ** is a method for learning a Mahalnobis distance measure for k-nearest neighbours (kNN). In particular, it finds a distance metric that maximises the leave one out error on the training set for a stochastic variant of kNN. NCA can also learn a low-dimensional linear embedding of labelled data for data visualisation and for improved kNN classification speed. Unlike other methods, this classification model is non-parametric without any assumption on the shape of the class distributions or the boundaries between them. Joint work with Sam Roweis, Geoffrey Hinton and Ruslan Salakhutdinov.

Neighbourhood Component Analysis. NIPS 2004. matlab code C code

**Mixture of Gaussians, Distance and Simplification.** The Mixture of Gaussians (MoG) is one of the simplest graphical models. It is a flexible and powerful parametric framework for unsupervised data grouping. We suggested several approaches for computing a meaningful distances between two MoGs and for MoG model simplification. The Unscented Transform is traditionally used for generalized Kalman filter in non-linear dynamical systems. We proposed the usage of the Unscented Transform for computing the KL divergence between two MoGs and for model simplification.

Simplifying mixture models using the unscented transform. IEEE PAMI, 2008.

A distance measure between GMMs based on the unscented transform and its application to speaker recognition. Eurospeech, 2005.

Hierarchical clustering of a mixture model. NIPS, 2004.

An efficient similarity measure based on approximations of KL-divergence between two Gaussian mixtures. ICCV, 2003.