Qun Chen 陈群
Professor
Tel / Wechat: 13299168988
Email: chenbenben@nwpu.edu.cn
Office: Room 316, School of Computer Science
Address: Northwestern Polytechnical University
Xi'an Shaanxi, China

Bio

  • Sep 1993 - Jul 1998. Undergraduate, Management Information System, Tsinghua University.
  • Aug 1999 - Oct 2003. Ph.D, Computer Science, National University of Singapore.
  • Nov 2003 - Feb 2007. Postdoc, Hong Kong University of Science and Technology.
  • Mar 2007 - Now. Professor and doctoral supervisor, Northwestern Polytechnical University.

Research Interests

Due to the uncertainty and un-interpretability of DNN, I focus on risk analysis for deep AI models, i.e. analyzing and evaluating the risk that an AI model mislabels a target instance in a classification problem. Risk analysis is by itself an important and interesting research problem. Moreover, it can have a profound impact on the design and implementation of core machine learning operations, e.g. active selection of training instances, model training and model selection.

Even though deep learning has achieved tremendous success, its efficacy usually relies on a large number of accurately labeled training data. Unfortunately, high-quality labeled data may not be readily available in real AI applications.

I have proposed a new non-i.i.d paradigm of machine learning, namely Gradual machine learning (GML). Given a classification task, GML begins with the easy instances, which can usually be automatically labeled by the machine with high accuracy, and then gradually labels more challenging instances based on evidential certainty by iterative factor inference. Compared with traditional i.i.d learning (e.g. deep learning), GML is more interpretable and requires less or even no manually labeled data.

Recent Publications

Few-shot Image Classification based on Gradual Machine Learning. Expert Systems with Applications, 2024.
Na Chen, Xianming Kuang, Feiyu Liu, Kehao Wang, Qun Chen.
[Abstract]  [PDF]

Few-shot image classification aims to accurately classify unlabeled images using only a few labeled samples. The state-of-the-art solutions are built by deep learning, which focuses on designing increasingly complex deep backbones. Unfortunately, the task remains very challenging due to the difficulty of transferring the knowledge learned in training classes to new ones. In this paper, we propose a novel approach based on the non-i.i.d paradigm of gradual machine learning (GML). It begins with only a few labeled observations, and then gradually labels target images in the increasing order of hardness by iterative factor inference in a factor graph. Specifically, our proposed solution extracts indicative feature representations by deep backbones, and then constructs both unary and binary factors based on the extracted features to facilitate gradual learning. The unary factors are constructed based on class center distance in an embedding space, while the binary factors are constructed based on k-nearest neighborhood. We have empirically validated the performance of the proposed approach on benchmark datasets by a comparative study. Our extensive experiments demonstrate that the proposed approach can improve the SOTA performance by 1-5% in terms of accuracy. More notably, it is more robust than the existing deep models in that its performance can consistently improve as the size of query set increases while the performance of deep models remains essentially flat or even becomes worse.

Adaptive deep learning for entity disambiguation via knowledge-based risk analysis. Expert Systems with Applications, 2024.
Youcef Nafa, Qun Chen, Boyi Hou, Zhanhuai Li.
[Abstract]  [PDF]

The state-of-the-art performance on entity disambiguation has been reached by deep neural networks. However, the task remains very challenging due to the complexity of natural language. Moreover, the target data distribution is often different from that of training data. In this paper, we address the limitation of deep entity disambiguation from the perspective of misprediction risk. We propose a knowledge-based approach of risk analysis for entity disambiguation, and leverage it to enable adaptive deep learning. The proposed approach generates risk features by extracting evidences from the knowledge base, and then models them as a linearly-weighted random vector where an attention mechanism is used to focus on the most significant components. Finally, it estimates misprediction risk of the aggregated probability distribution via the Conditional Value-at-Risk metric. Furthermore, we demonstrate how to utilize risk analysis results in adaptive deep learning via two-phase training, the first phase fits on labeled training data while the second one minimizes misprediction risk on unlabeled target data. We evaluate the performance of the proposed approach on benchmark datasets through a comparative study. Our thorough experiments demonstrate that it can detect mispredictions more accurately than existing alternatives and can substantially improve the performance of deep learning models.

Supervised Gradual Machine Learning for Aspect-Term Sentiment Analysis. Transactions of ACL (TACL), 2023.
Yanyan Wang,Qun Chen,Murtadha H.M.Ahmed,Zhaoqiang Chen,Jing Su,Wei Pan,Zhanhuai Li.
[Abstract]  [PDF]  [Homepage]

Recent work has shown that Aspect-Term Sentiment Analysis (ATSA) can be effectively performed by Gradual Machine Learning (GML). However, the performance of the current unsupervised solution is limited by inaccurate and insufficient knowledge conveyance.In this paper, we propose a supervised GML approach for ATSA, which can effectively exploit labeled training data to improve knowledge conveyance. It leverages binary polarity relations between instances, which can be either similar or opposite, to enable supervised knowledge conveyance. Besides the explicit polarity relations indicated by discourse structures, it also separately supervises a polarity classification DNN and a binary siamese network to extract implicit polarity relations. The proposed approach fulfills knowledge conveyance by modeling detected relations as binary features in a factor graph. Our extensive experiments on real benchmark data show that it achieves the state-of-the-art performance across all the test workloads. Our work demonstrates clearly that, in collaboration with DNN for feature extraction, GML outperforms pure DNN solutions.

Adaptive Deep Learning for Entity Resolution by Risk Analysis.Knowledge-Based Systems(KBS), 2023.
Qun Chen,Zhaoqiang Chen,Youcef Nafa,Tianyi Duan,Wei Pan,Lijun Zhang,Zhanhuai Li
[Abstract]  [PDF]


The state-of-the-art performance on entity resolution (ER) has been achieved by deep learning. However, deep models are usually trained on large quantities of accurately labeled training data, and can not be easily tuned towards a target workload. Unfortunately, in real scenarios, there may not be sufficient labeled training data, and even worse, their distribution is usually more or less different from the target workload even when they come from the same domain.
To alleviate the said limitations, this paper proposes a novel risk-based approach to tune a deep model towards a target workload by its particular characteristics. Built on the recent advances on risk analysis for ER, the proposed approach first trains a deep model on labeled training data, and then fine-tunes it by minimizing its estimated misprediction risk on unlabeled target data. Our theoretical analysis shows that risk-based adaptive training can correct the label status of a mispredicted instance with a fairly good chance. We have also empirically validated the efficacy of the proposed approach on real benchmark data by a comparative study. Our extensive experiments show that it can considerably improve the performance of deep models. Furthermore, in the scenario of distribution misalignment, it can similarly outperform the state-of-the-art alternative of transfer learning by considerable margins. Using ER as a test case, we demonstrate that risk-based adaptive training is a promising approach potentially applicable to various challenging classification tasks

Adaptive Deep Learning for Network Intrusion Detection by Risk Analysis. Neurocomputing,2022.
Lijun Zhang, Xingyu Lu, Zhaoqiang Chen, Tianwei Liu, Qun Chen, Zhanhuai Li
[Abstract]  [PDF]  [Code]  [Data]


With increasing connectedness, network intrusion has become a critical security concern for modern information systems. The state-of-the-art performance of Network Intrusion Detection(NID) has been achieved by deep learning. Unfortunately, NID remains very challenging, and in real scenarios, deep models may still mislabel many network activities. Therefore, there is a need for risk analysis, which aims to know which activities may be mislabeled and why.
In this paper,we propose a novel solution of interpretable risk analys is for NID that can rank the activities in a task by their mislabeling risk. Built upon the existing framework of LearnRisk, it first extracts interpretable risk features and then trains a risk model by a learning-to-rank objective. It constructs risk features based on domain knowledge of network intrusion as well as statistical characteristics of activities. Furthermore, we demonstrate how to leverage risk analysis to improve prediction accuracy of deep models. Specifically, we present an adaptive training approach for NID that can effectively fine-tune a deep model towards a particular workload by minimizing its misprediction risk. Finally, we empirically evaluate the performance of the proposed solutions on real benchmark data. Our extensive experiments have shown that the proposed solution of risk analysis can identify mislabeled activities with considerably higher accuracy than the existing alternatives, and the proposed solution of adaptive training can effectively improve the performance of deep models by considerable margins in both offline and online settings.

Active Deep Learning on Entity Resolution by Risk Sampling. Knowledge-Based Systems(KBS), 2021.
Youcef Nafa, Qun Chen,Zhaoqiang Chen,Xingyu Lu,Tianyi Duan and Zhanhuai Li.
[Abstract]  [PDF]

While the state-of-the-art performance on entity resolution (ER) has been achieved by deep learning, its effectiveness depends on large quantities of accurately labeled training data. To alleviate the data labeling burden, Active Learning (AL) presents itself as a feasible solution that focuses on data deemed useful for model training. Building upon the recent advances in risk analysis for ER, which can provide a more refined estimate on label misprediction risk than the simpler classifier outputs, we propose a novel AL approach of risk sampling for ER. Risk sampling leverages misprediction risk estimation for active instance selection. Based on the core-set characterization for AL, we theoretically derive an optimization model which aims to minimize core-set loss with non-uniform Lipschitz continuity. Since the defined weighted K-medoids problem is NP-hard, we then present an efficient heuristic algorithm. Finally, we empirically verify the efficacy of the proposed approach on real data by a comparative study. Our extensive experiments have shown that it outperforms the existing alternatives by considerable margins.

DNN-driven Gradual Machine Learning for Aspect-Term Sentiment Analysis. Findings of ACL, 2021.
Murtadha AHMED, QUN CHEN, Yanyan Wang, Youcef Nafa, Zhanhuai Li and Tianyi Duan
[Abstract]  [PDF]

Usually Recent work has shown that Aspect-Term Sentiment Analysis (ATSA) can be performed by Gradual Machine Learning (GML), which begins with some automatically labeled easy instances, and then gradually labels more challenging instances by iterative factor graph inference without manual intervention. As a non-i.i.d learning paradigm, GML leverages shared features between labeled and unlabeled instances for knowledge conveyance. However, the existing GML solution extracts sentiment features based on pre-specified lexicons, which are usually inaccurate and incomplete and thus lead to inadequate knowledge conveyance. In this paper, we propose a Deep Neural Network (DNN) driven GML approach for ATSA, which exploits the power of DNN in feature representation for gradual learning. It first uses an unsupervised neural network to cluster the automatically extracted features by their sentiment orientation. Then, it models the clustered features as factors to enable implicit knowledge conveyance for gradual inference in a factor graph. To leverage labeled training data, we also present a hybrid solution that fulfills gradual learning by fusing the influence of supervised DNN predictions and implicit knowledge conveyance in a unified factor graph. Finally, we empirically evaluate the performance of the proposed approach on real benchmark data. Our extensive experiments have shown that the proposed approach consistently achieves the state-of-the-art performance across all the test datasets in both unsupervised and supervised settings and the improvement margins are considerable.

Attention-enhanced Gradual Machine Learning for Entity Resolution. IEEE Intelligent Systems, 2021.
Ping Zhong, Zhanhuai Li, Qun Chen and Boyi Hou
[Abstract]  [PDF]

Recent work has shown that Entity Resolution (ER) can be effectively performed by Gradual Machine Learning (GML). GML begins with some automatically labeled easy instances, and then gradually labels more challenging instances by iterative factor graph inference without human intervention. In GML, shared features serve as the medium for knowledge conveyance between easy instances and more challenging ones. The existing GML solution supposes that features play independent roles in gradual inference. However, in real scenarios, this assumption may be untenable since features are usually correlated with each other. To address this limitation, this paper proposes an attention-enhanced approach to improve the accuracy of gradual inference. We first propose a method of spectral feature representation to map correlated features to close points in the same vector space, and then present a model of attention neural network to learn the decisive features given arbitrary combinations of features for improved feature weighting. Finally, our extensive experiments on real benchmark data have validated the efficacy of the proposed approach.

Aspect-Level Sentiment Analysis based on Gradual Machine Learning.Knowledge-Based Systems(KBS),2021.
Yanyan Wang, Qun Chen, Jiquan Shen, Boyi Hou, Murtadha Ahmed, Zhanhuai Li
[Abstract]  [PDF]  [Homepage]

The state-of-the-art solutions for Aspect-Level Sentiment Analysis (ALSA) were built on a variety of Deep Neural Networks (DNN), whose efficacy depends on large quantities of accurately labeled training data. Unfortunately, high-quality labeled training data usually require expensive manual work, thus may not be readily available in real scenarios. In this paper, we propose a novel approach for aspect-level sentiment analysis based on the recently proposed paradigm of Gradual Machine Learning (GML), which can enable accurate machine labeling without the requirement for manual labeling effort. It begins with some easy instances in a task, which can be automatically labeled by the machine with high accuracy, and then gradually labels the more challenging instances by iterative factor graph inference. In the process of gradual machine learning, the hard instances are gradually labeled in small stages based on the estimated evidential certainty provided by the labeled easier instances. Our extensive experiments on the benchmark datasets have shown that the performance of the proposed solution is considerably better than its unsupervised alternatives, and also highly competitive compared with the state-of-the-art supervised DNN models.

Gradual Machine Learning for Entity Resolution. IEEE Transactions on Knowledge and Data Engineering (TKDE), 2020.
Boyi Hou, Qun Chen, Yanyan Wang, Youcef Nafa, and Zhanhuai Li
[Abstract]  [PDF]

Usually considered as a classification problem, entity resolution (ER) can be very challenging on real data due to the prevalence of dirty values. The state-of-the-art solutions for ER were built on a variety of learning models (most notably deep neural networks), which require lots of accurately labeled training data. Unfortunately, high-quality labeled data usually require expensive manual work, and are therefore not readily available in many real scenarios. In this paper, we propose a novel learning paradigm for ER, called gradual machine learning, which aims to enable effective machine labeling without the requirement for manual labeling effort. It begins with some easy instances in a task, which can be automatically labeled by the machine with high accuracy, and then gradually labels more challenging instances by iterative factor graph inference. In gradual machine learning, the hard instances in a task are gradually labeled in small stages based on the estimated evidential certainty provided by the labeled easier instances. Our extensive experiments on real data have shown that the performance of the proposed approach is considerably better than its unsupervised alternatives, and highly competitive compared to the state-of-the-art supervised techniques. Using ER as a test case, we demonstrate that gradual machine learning is a promising paradigm potentially applicable to other challenging classification tasks requiring extensive labeling effort.

Towards Interpretable and Learn able Risk Analysis for Entity Resolution. International Conference on Management of Data (SIGMOD), 2020.
Zhaoqiang Chen, Qun Chen, Boyi Hou, Tianyi Duan, Zhanhuai Li and Guoliang Li
[Abstract]  [Bibtex]  [PDF]

Machine-learning-based entity resolution has been widely studied. However, some entity pairs may be mislabeled by machine learning models and existing studies do not study the risk analysis problem-predicting and interpreting which entity pairs are mislabeled. In this paper, we propose an in terpretable and learnable framework for risk analysis, which aims to rank the labeled pairs based on their risks of being mislabeled. We first describe how to automatically generate interpretable risk features, and then present a learnable risk model and its training technique. Finally, we empirically eval uate the performance of the proposed approach on real data. Our extensive experiments have shown that the learning risk model can identify the mislabeled pairs with considerably higher accuracy than the existing alternatives.

@article{chen2019towards,
title={Towards Interpretable and Learnable Risk Analysis for Entity Resolution},
author={Chen, Zhaoqiang and Chen, Qun and Hou, Boyi and Duan, Tianyi and Li, Zhanhuai and Li, Guoliang},
j ournal={arXiv preprint arXiv:1912.02947},
year={2019}
}

Gradual Machine Learning for Entity Resolution. WWW 2019.
Boyi Hou, Qun Chen, Jiquan Shen, Xin Liu, Ping Zhong, Yanyan Wang, Zhaoqiang Chen,Zhanhuai Li
[Abstract]  [Bibtex]  [PDF]  [Homepage]

Usually considered as a classification problem, entity resolution can be very challenging on real data due to the prevalence of dirty values. The state-of-the-art solutions for ER were built on a variety of learning models (most notably deep neural networks), which require lots of accurately labeled training data. Unfortunately, high quality labeled data usually require expensive manual work, and are therefore not readily available in many real scenarios. In this demo, we propose a novel learning paradigm for ER, called gradual machine learning, which aims to enable effective machine label ing without the requirement for manual labeling effort. It begins with some easy instances in a task, which can be automatically labeled by the machine with high accuracy, and then gradually labels more challenging instances based on iterative factor graph inference. In gradual machine learning, the hard instances in a task are gradually labeled in small stages based on the estimated evidential certainty provided by the labeled easier instances. Our extensive experiments on real data have shown that the proposed approach performs considerably better than its unsupervised alter natives, and its performance is also highly competitive compared to the state-of-the-art supervised techniques. Using ER as a test case, we demonstrate that gradual machine learning is a promising paradigm potentially applicable to other challenging classification tasks requiring extensive labeling effort.

@inproceedings{hou2019gradual,
title={Gradual machine learning for entity resolution},
author={Hou, Boyi and Chen, Qun and Shen, Jiquan and Liu, Xin and Zhong, Ping and Wang, Yanyan and Chen, Zhaoqiang and Li, Zhanhuai},
booktitle={The World Wide Web Conference},
pages={3526--3530},
year={2019},
organization={ACM}
}

Joint Inference for Aspect-Level Sentiment Analysis by Deep Neural Networks and Linguistic Hints. IEEE Transactions on Knowledge and Data Engineering (TKDE), 2019.
Yanyan Wang, Qun Chen, Murtadha Ahmed, Zhanhuai Li, Wei Pan, and Hailong Liu
[Abstract]  [Bibtex]  [PDF]  [Homepage]

The state-of-the-art techniques for aspect-level sentiment analysis focused on feature modeling using a variety of deep neural networks (DNN). Unfortunately, their performance may still fall short of expectation in real scenarios due to the semantic complexity of natural languages. Motivated by the observation that many linguistic hints (e.g., sentiment words and shift words) are reliable polarity indicators, we propose a joint framework, SenHint, which can seamlessly integrate the output of deep neural networks and the implications of linguistic hints in a unified model based on Markov logic network (MLN). SenHint leverages the linguistic hints for multiple purposes: (1) to identify the easy instances, whose polarities can be automatically determined by the machine with high accuracy; (2) to capture the influence of sentiment words on aspect polarities; (3) to capture the implicit relations between aspect polarities. We present the required techniques for extracting linguistic hints, encoding their implications as well as the output of DNN into the unified model, and joint inference. Finally, we have empirically evaluated the performance of SenHint on both English and Chinese benchmark datasets. Our extensive experiments have shown that compared to the state-of-the-art DNN techniques, SenHint can effectively improve polarity detection accuracy by considerable margins.

@article{wang2019joint,
title={Joint Inference for Aspect-level Sentiment Analysis by Deep Neural Networks and Linguistic Hints},
author={Wang, Yanyan and Chen, Qun and Ahmed, Murtadha and Li, Zhanhua and Pan, Wei and Liu, Hailong},
journal={IEEE Transactions on Knowledge and Data Engineering},
year={2019},
publisher={IEEE}

}

Improving Machine-based Entity Resolution with Limited Human Effort: A Risk Perspective. International Workshop on Real-Time Business Intelligence and Analytics, 2018.
Zhaoqiang Chen, Qun Chen, Boyi Hou, Murtadha Ahmed, Zhanhuai Li
[Abstract]  [Bibtex]  [PDF]  [Technical report]

Pure machine-based solutions usually struggle in the challenging classification tasks such as entity resolution (ER). To alleviate this problem, a recent trend is to involve the human in the resolution process, most notably the crowdsourcing approach. However, it remains very challenging to effectively improve machine-based entity resolution with limited human effort. In this paper, we investigate the problem of human and machine cooperation for ER from a risk perspective. We propose to select the machine-labeled instances at high risk of being mislabeled for manual verification. For this task, we present a risk model that takes into consideration the human-labeled instances as well as the output of machine resolution. Finally, we evaluate the performance of the proposed risk model on real data. Our experiments demonstrate that it can pick up the mislabeled instances with considerably higher accuracy than the existing alternatives. Provided with the same amount of human cost budget, it can also achieve better resolution quality than the state-of-the-art approach based on active learning.

@inproceedings{chen2018risker,
title={Improving Machine-based Entity Resolution with Limited Human Effort: A Risk Perspective},
author={Chen, Zhaoqiang and Chen, Qun and Hou, Boyi and Ahmed, Murtadha and Li, Zhanhuai},
booktitle={Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics},
series={BIRTE'18},
numpages={5},
year={2018},
doi={10.1145/3242153.3242156},
publisher={ACM},
}

r-HUMO: A Risk-aware Human-Machine Cooperation Framework for Entity Resolution with Quality Guarantees. IEEE Transactions on Knowledge and Data Engineering (TKDE), 2018.
Boyi Hou, Qun Chen, Zhaoqiang Chen, Youcef Nafa, Zhanhuai Li
[Abstract]  [Bibtex]  [PDF]  [Technical report]

Even though many approaches have been proposed for entity resolution (ER), it remains very challenging to enforce quality guarantees. To this end, we propose a risk-aware HUman-Machine cOoperation framework for ER, denoted by r-HUMO. Built on the existing HUMO framework, r-HUMO similarly enforces both precision and recall guarantees by partitioning an ER workload between the human and the machine. However, r-HUMO is the first solution that optimizes the process of human workload selection from a risk perspective. It iteratively selects human workload by real-time risk analysis based on the human-labeled results as well as the pre-specified machine metric. In this paper, we first introduce the r-HUMO framework and then present the risk model to prioritize the instances for manual inspection. Finally, we empirically evaluate r-HUMO's performance on real data. Our extensive experiments show that r-HUMO is effective in enforcing quality guarantees, and compared with the state-of-the-art alternatives, it can achieve desired quality control with reduced human cost.

@article{hou2018rhumo,
title={r-HUMO: A Risk-aware Human-Machine Cooperation Framework for Entity Resolution with Quality Guarantees},
author={Hou, Boyi and Chen, Qun and Chen, Zhaoqiang and Nafa, Youcef and Li, Zhanhuai},
booktitle={IEEE Transactions on Knowledge and Data Engineering (TKDE)},
year={2018},
doi={10.1109/TKDE.2018.2883532},
publisher={IEEE},
}

SenHint: A Joint Framework for Aspect-level Sentiment Analysis by Deep Neural Networks and Linguistic Hints. WWW 2018.
Yanyan Wang, Qun Chen, Xin Liu, Murtadha Ahmed, Zhanhuai Li, Wei Pan, Hailong Liu
[Abstract]  [Bibtex]  [PDF]  [Homepage]

The state-of-the-art techniques for aspect-level sentiment analysis focus on feature modeling using a variety of deep neural networks (DNN). Unfortunately, their practical performance may fall short of expectations due to semantic complexity of natural languages. Motivated by the observation that linguistic hints (e.g. explicit sentiment words and shift words) can be strong indicators of sentiment, we present a joint framework, SenHint, which integrates the output of deep neural networks and the implication of linguistic hints into a coherent reasoning model based on Markov Logic Network (MLN). In SenHint, linguistic hints are used in two ways: (1) to identify easy instances, whose sentiment can be automatically determined by machine with high accuracy; (2) to capture implicit relations between aspect polarities. We also empirically evaluate the performance of SenHint on both English and Chinese benchmark datasets. Our experimental results show that SenHint can effectively improve accuracy compared with the state-of-the-art alternatives.

@inproceedings{DBLP:conf/www/WangCLALPL18,
author={Wang, Yanyan and Chen, Qun and Liu, Xin and Ahmed, Murtadha and Li, Zhanhuai and Pan, Wei and Liu, Hailong},
title = {SenHint: {A} Joint Framework for Aspect-level Sentiment Analysis by
Deep Neural Networks and Linguistic Hints},
booktitle = {Companion of the The Web Conference 2018 on The Web Conference 2018,
{WWW} 2018, Lyon , France, April 23-27, 2018},
pages = {207--210},
year = {2018},
crossref = {DBLP:conf/www/2018c},
url = {http://doi.acm.org/10.1145/3184558.3186980},
doi = {10.1145/3184558.3186980},
timestamp = {Tue, 24 Apr 2018 14:09:22 +0200},
biburl = {https://dblp.org/rec/bib/conf/www/WangCLALPL18},
bibsource = {dblp computer science bibliography, https://dblp.org}
}

Enabling Quality Control for Entity Resolution: A Human and Machine Cooperation Framework. ICDE 2018.
Zhaoqiang Chen, Qun Chen, Fengfeng Fan, Yanyan Wang, Zhuo Wang, Youcef Nafa, Zhanhuai Li, Hailong Liu, Wei Pan
[Abstract]  [Bibtex]  [PDF]  [Slides]

Even though many machine algorithms have been proposed for entity resolution, it remains very challenging to find a solution with quality guarantees. In this paper, we propose a novel Human and Machine cOoperation (HUMO) framework for entity resolution (ER), which divides an ER workload between the machine and the human. HUMO enables a mechanism for quality control that can flexibly enforce both precision and recall levels. We introduce the optimization problem of HUMO, minimizing human cost given a quality requirement, and then present three optimization approaches: a conservative baseline one purely based on the monotonicity assumption of precision, a more aggressive one based on sampling and a hybrid one that can take advantage of the strengths of both previous approaches. Finally, we demonstrate by extensive experiments on real and synthetic datasets that HUMO can achieve high-quality results with reasonable return on investment (ROI) in terms of human cost, and it performs considerably better than the state-of-the-art alternatives in quality control.

@INPROCEEDINGS{chen2018humo,
author={Z. Chen and Q. Chen and F. Fan and Y. Wang and Z. Wang and Y. Nafa and Z. Li and H. Liu and W. Pan},
booktitle={2018 IEEE 34th International Conference on Data Engineering (ICDE)},
title={Enabling Quality Control for Entity Resolution: A Human and Machine Cooperation Framework},
year={2018},
pages={1156-1167},
doi={10.1109/ICDE.2018.00107},
month={April},
}

A Human-and-Machine Cooperative Framework for Entity Resolution with Quality Guarantees. ICDE 2017.
Zhaoqiang Chen, Qun Chen, Zhanhuai Li
[Abstract]  [Bibtex]  [PDF]  [Homepage]

For entity resolution, it remains very challenging to find the solution with quality guarantees as measured by both precision and recall. In this demo, we propose a HUman-andMachine cOoperative framework, denoted by HUMO, for entity resolution. Compared with the existing approaches, HUMO enables a flexible mechanism for quality control that can enforce both precision and recall levels. We also introduce the problem of minimizing human cost given a quality requirement and present corresponding optimization techniques. Finally, we demo that HUMO achieves high-quality results with reasonable return on investment (ROI) in terms of human cost on real datasets.

@inproceedings{DBLP:conf/icde/ChenCL17,
author = {Zhaoqiang, Chen and Qun, Chen and Zhanhuai, Li},
title = {A Human-and-Machine Cooperative Framework for Entity Resolution with
Quality Guarantees},
booktitle = {33rd {IEEE} International Conference on Data Engineering, {ICDE} 2017,
San Diego, CA, USA, April 19-22, 2017},
pages = {1405--1406},
year = {2017},
crossref = {DBLP:conf/icde/2017},
url = {https://doi.org/10.1109/ICDE.2017.197},
doi = {10.1109/ICDE.2017.197},
timestamp = {Wed, 24 May 2017 11:31:57 +0200},
biburl = {https://dblp.org/rec/bib/conf/icde/ChenCL17},
bibsource = {dblp computer science bibliography, https://dblp.org}
}