Popular Keywords


Applied Microbiology





Constrained Random theory for Rapid Identification of Epidemic-Related Websites in Covid-19 Media Reports

Correspondence to Author:  Guiyun Zuang, Zinming Hao, 

School of Science, Hubei University of Technology, Wuhan, Hubei, China.


: Following the early December 2019 COVID-19 outbreak in Wuhan, China, the Chinese government established a system for information disclosure. In relation to newly diagnosed cases of novel coronavirus pneumonia, more than 400 cities have released precise location details, including residential areas and places of stay. Based on elements of Chinese geographical names, we have established a rule-dependent model and a conditional random field model. The named entity identification and the automatic extraction of sites related to the epidemic are done, using Guangdong province as an example. This approach will help identify the epidemic’s spread, stop and manage it, and buy more time for vaccine clinical trials.
A rule-dependent model is established in accordance with the combination rule of the elements of the place words and the place name dictionary composed of provinces, cities, and administrative regions, and a conditional random field model is established based on the presentation form of the habitual place or place of stay of the diagnosed cases in the text of the web page.
Keywords :
COVID-19; Place Name Recognition; Web Crawler; Part-OfSpeech Tagging; Conditional Random Field; Elements of Chinese Place Name.

Introduction:  The COVID-19 virus first surfaced in Wuhan, Hubei, China, in early December 2019, and in just a few short months, it spread throughout the world [1]. China has been experiencing a crisis over the last few months, but official government data indicates that China has essentially stopped local transmission. While some nations have moved into China’s early stages, conditions in other nations are currently worse than they were during the height of the epidemic in China. We believe that one of the reasons China is able to contain the epidemic is that both the national and local governments have high levels of information transparency, and the appropriate agencies promptly disseminate the most recent information regarding the outbreak. Finding possible patients can be greatly aided by examining data such as the residential area or activity location of officially confirmed COVID-19 diagnosed cases, according to epidemiological research. Viral experts can use these data to create models of epidemic transmission that allow them to assess and forecast the infection source, transmission speed, transmission path, and propagation risk. This is because the community can use these data to implement targeted prevention and control, granting people the right to know and better personal protection. The residential area or activity location of the diagnosed case is typically expressed in the epidemic report of web pages in a variety of ways, including the page body, embedded text, and screenshots. The distribution of epidemic-related sites from these information sources must first be timely analysed.
This task was primarily completed by hand in the past by looking for and categorising the relevant information in the text. There is a lot of work to be done, not much efficiency, and no punctuality. As named entity recognition technology has advanced over the past few years, the focus of this work has gradually shifted from manual to automatic extraction, which not only uses less money and human resources but also processes tasks more quickly. The process of locating named entities in text and classifying them into related entity types is known as named entity recognition [2]. Names of people, places, dates, and organisations are among the general entity types. Our primary goal is to recognise place names in Chinese. The close arrangement of Chinese characters in Chinese text, the use of multiple characters in sentences, and the lack of spaces between words make it more challenging to identify named entities. From the original rule and dictionary technique to the conventional statistical learning method to the present deep learning method, named entity recognition technology has evolved to increase recognition accuracy [3–7]. In terms of accuracy for several common entities, current technology has essentially advanced. This article’s goal is to process the text data on the webpage in order to present the entities and relationships.

The Model;   Using named entity recognition technology, we attempted to identify and extract place words from the collected unstructured text data. We then classified the identified place words based on a set of rules, dividing them into administrative regions, cities, provinces, and specific locations. In order to provide precise data for the epidemic development model built by researchers in the future, as well as to evaluate and forecast the source of infection, the rate of transmission, and the route of transmission, we lastly perform statistical analysis on the location data. Chinese place name recognition can be studied using three main approaches: rule-based, statistics-based, and deep learning-based. The rule-based approach is natural and intuitive, making it simple for people to comprehend and use. Rule writing, however, requires domain and language specific knowledge. The portability is also poor, covering all the modes is challenging, and the rules are more complex [10,11]. While statistics-based approaches are very portable and do not necessitate extensive language or domain knowledge, they do require manual corpus annotation and the selection of suitable statistical learning models and parameters [12–14]. In order to create an end-to-end model, deep learning-based techniques can automatically extract information from the input without the need for unduly complicated feature engineering [13]. This paper’s content is mostly based on the first two methods because the amount of text data that may be gathered is limited in terms of both time and data.Named entity recognition has shown promise recently for a few restricted entity kinds. For instance, there is a notable recognition effect on the names of individuals, locations, and organisations in news corpuses. One could consider Chinese place name recognition to be a sequence labelling challenge. The place name entity identification process consists of identifying the right names from these word sequences. The place name is made up of several words arranged in a specific order. The hidden Markov model and the maximum entropy model are combined in the conditional random field model. and are applicable to the segmentation and labelling of sequence data.

Entity Relationships Extraction Method;   Relational semantics recognition is always changing and can be categorised into two categories: machine learning and rule matching methods. During relationship identification, the rule template is compared to the statement using the rule template matching method, which is predefined. The entity in the statement has the relationship indicated in the template Attributes if the statement satisfies the characteristics of the characteristic template [18]. The drawback is that it has poor portability and takes a long time to write a large number of feature templates, requiring more experienced linguists [19]. Using a variety of pattern recognition feature models, the machine learning method computes entity relationship features and weight values in sentences using associated algorithms. For handling entity relationships, there are currently two widely used categories of machine learning techniques: kernel-based techniques and feature vector-based techniques [20, 21]. Our research aims to carry out location extraction. The geographic location relationship’s feature template is highly portable and comparatively fixed. Therefore, in order to extract the identified place words for relationship, we will employ the rule-based matching method. Rule-making and corpus pre-processing have three facets.

Results   This paper makes use of the corpus that People’s Daily marked in January 1998, of which 80% is chosen as the training set, 20% is used as the closed test set, and the open test set is the news release about the COVID-19 outbreak that was crawled through the Internet. Table 6 displays the entity recognition results. The experimental data shows that place name recognition yields results with higher accuracy. Both the open and closed training sets have potential F values of 0.771 and 0.870, respectively. The entity recognition results show that the following categories of incorrect place entity recognition predominate: a. The text contains abbreviations for cities and provinces, and place names in unclear forms can be recognised. For instance, “Zhongshan” can refer to both a city in the province of Guangdong and an administrative district in the province of Liaoning; b. Certain place names are used in more than one city. For instance, several cities’ roads go by the name “Baojian Road.” It can be challenging to figure out which city this route name belongs to when there are several cities mentioned in a phrase; b. Words have distinct meanings depending on where they are placed. For instance, a town or building may be known by the name “Bajiao Tower”;

Discussion   Without a doubt, there are a variety of perspectives from which we can examine the evolution of COVID-19 in a particular region. This article primarily examines the degree of epidemic spread. We think that the number of epidemic outbreak locations in a region can be used to quantify the spread of epidemics. Research by other academics has revealed that cities with greater levels of economic development and a greater number of migrant populations experience a higher number of imported cases compared to other cities [22]. Greater levels of economic development are primarily found in the Pearl River Delta’s core urban agglomeration, which is located in the northern region of the province of Guangdong. Guangzhou is one of these cities. hence there has been a greater Guangzhou pandemic spread. There aren’t many imported cases in Heyuan City because it’s in a rural, mountainous location with less transportation. If the epidemic is thought to be spreading swiftly, action should be taken as soon as the COVID-19 infectious disease reaches its early stages. Owing to COVID-19’s protracted incubation period, infectious diseases might have spread before a case’s symptoms manifested [23]. The traditional method of gathering data takes longer than the method of extracting data, so it can be used to identify the area that requires attention and identify the source of the disease.

Conclusion:   The fundamental task of text processing, known as natural language processing, has many applications, one of which is the recognition of named entities. In order to accomplish the task of place name recognition, extraction, and classification, this paper proposes a named entity recognition method based on conditional random field model and a relationship extraction method based on rule matching.
Mainly, this article finished the following tasks: a. To start, we download 366 epidemic websites using web crawler technology in order to collect unstructured data. Because different websites on the Internet have different organisational structures, the fixed search mode is unable to efficiently crawl data. It is still unclear how to incorporate crawling rules to enhance crawler performance; b. Next, we test the epidemic text using the learned conditional random field model. Conditional random fields are a rather good machine learning technique that have shown promise in entity recognition. In order to provide a strong theoretical framework for future research, this article begins with the theoretical side of things. It then elaborates on the model derivation, training algorithm, and labelling techniques of the conditional random field model; Ultimately, we extract place terms using rule-based techniques, categorising them into four groups, and obtaining structured data on epidemic sites. To increase the classification accuracy, we must incorporate additional features into the relationship extraction rules in further work. In the future, relationship extraction and named entity identification can be combined to provide.


Guiyun Zuang, Zinming Hao. Constrained Random theory for Rapid Identification of Epidemic-Related Websites in Covid-19 Media Reports. The Journal of Clinical Microbiology 2024.

Journal Info

  • Journal Name: The Journal of Clinical Microbiology
  • Impact Factor: 1.803*
  • ISSN: ISSN 2995-8539
  • DOI: 10.52338/Tjocmb
  • Short Name: TJOCMB
  • Acceptance rate: 55%
  • Volume: 6 (2024)
  • Submission to acceptance: 25 days
  • Acceptance to publication: 10 days
  • Crossref indexed journal
  • Publons indexed journal
  • Pubmed-indexed journal
  • International Scientific Indexing (ISI)-indexed journal
  • Eurasian Scientific Journal Index (ESJI) index journal
  • Semantic Scholar indexed journal
  • Cosmos indexed journal


  • International Reach
  • Peer Review
  • Rapid Publication
  • Open Access
  • High Visibility