|
@@ -0,0 +1,28 @@
|
|
|
+# Gnews Keyword Extraction
|
|
|
+**Extract keywords in target domain news from DB (gnews.gnews_detail).**
|
|
|
+
|
|
|
+First, we use Transformer to get the news semantic vector, and use HDBSCAN for clustering.
|
|
|
+Then we can use [predict_doc.txt](predict_doc.txt) to predict which cluster the target domain may be in.
|
|
|
+Finally, we used unsupervised algorithms such as RAKE, TF-IDF, TextRank, and MultipartiteRank to extract the keywords of news in this cluster and store them in [tag_list.csv](predict_doc.txt).
|
|
|
+
|
|
|
+
|
|
|
+## Install
|
|
|
+```bash
|
|
|
+pip install -r requirements.txt
|
|
|
+```
|
|
|
+Note: See [requirements.txt](requirements.txt) for more details.
|
|
|
+
|
|
|
+## Usage
|
|
|
+### Step 1
|
|
|
+
|
|
|
+Copy the news content of the target domain and paste it to [predict_doc.txt](predict_doc.txt).
|
|
|
+
|
|
|
+### Step 2
|
|
|
+```bash
|
|
|
+python gnews_keyword_extraction.py --topK 80 --target_domain_range 1
|
|
|
+```
|
|
|
+* `--topK` *(optional)* : – Get the top K keywords. (Default: `80`)
|
|
|
+* `--target_domain_range` *(optional)* : – Candidate range of target domain. `0` means only one target cluster is used, `1` means a total of three clusters are used from the side. (Default: `1`)
|
|
|
+
|
|
|
+### Step 3
|
|
|
+Get keywords from [tag_list.csv](predict_doc.txt).
|