{ "cells": [ { "cell_type": "markdown", "id": "795dcb47", "metadata": {}, "source": [ "# Load data from database" ] }, { "cell_type": "code", "execution_count": 1, "id": "09956185", "metadata": {}, "outputs": [], "source": [ "import time\n", "import pymysql\n", "pymysql.install_as_MySQLdb()\n", "# import MySQLdb\n", "import dataset\n", "import pandas as pd\n", "import pickle\n", "from collections import Counter\n", "import numpy as np\n", "from random import sample\n", "\n", "start_time = time.time()" ] }, { "cell_type": "code", "execution_count": 2, "id": "2b0900b3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnews_urlnews_contentnews_daycrawler_date
02https://www.worldjournal.com/wj/story/121362/5...南加地產火熱,但建材漲,人工貴,房價順勢上揚。(記者李雪/攝影)\\n\\n新冠疫情 掀起一場風...None2021年05月23日
13https://www.chinatimes.com/newspapers/20210505...水泥為所有建設、工程主要原料,又是大宗民生物資之一,在水泥價格預期調漲下,市場指出,水泥漲價...2021-05-05 04:10:00+2021年05月23日
24https://www.singtao.ca/4950752/2021-05-19/news...舉報\\n\\n疫情爆發以來,本材價格急升超過三倍。 CBC\\n\\n【星島綜合報道】北美建材價格...2021-05-19 00:00:002021年05月23日
35https://tw.appledaily.com/property/20210430/LE...房地產業做為台灣經濟的火車頭,從土地、建物、建材、家具乃至設計裝潢產業等,市場皆不容小覷。2...2021-04-30 00:00:002021年05月23日
46https://news.sina.com.tw/article/20210522/3864...同意 AGREE\\n\\n如果您繼續閱讀,視同您同意我們隱私條款。This website u...2021-05-22 10:02:11+2021年05月23日
..................
42814290https://www.businesstoday.com.tw/article/categ...黃之揚先前做過許多工作,卻都難以持久,怎知一碰見海洋垃圾,就成了他終身志業的所在。\\n\\n來...2019-11-08 11:38:20+2021年06月13日
42824291https://house.udn.com/house/story/11134/4221093獎!獎!獎!清景麟「白易居THE ARCH」 獲三項國際設計獎\\n\\n撰文.攝影/張世雅\\n...None2021年06月13日
42834292https://www.ettoday.net/news/20190928/1543997.htm文/時尚家居 空間設計暨圖片提供/東之光室內裝修設計\\n\\n\\n\\n《2019艾特獎 最佳辦...2019-09-28 15:00:002021年06月13日
42844293https://www.epochtimes.com/b5/19/6/13/n1132041...室內裝修糾紛多?專家指點5步驟 避開陷阱不吃虧\\n\\n【大紀元2019年06月13日訊】(大...2019-06-17 19:59:11+2021年06月13日
42854294https://www.epochtimes.com/gb/19/6/13/n1132041...室内装修纠纷多?专家指点5步骤 避开陷阱不吃亏\\n\\n【大纪元2019年06月13日讯】(大...2019-06-17 19:59:11+2021年06月13日
\n", "

4286 rows × 5 columns

\n", "
" ], "text/plain": [ " id news_url \\\n", "0 2 https://www.worldjournal.com/wj/story/121362/5... \n", "1 3 https://www.chinatimes.com/newspapers/20210505... \n", "2 4 https://www.singtao.ca/4950752/2021-05-19/news... \n", "3 5 https://tw.appledaily.com/property/20210430/LE... \n", "4 6 https://news.sina.com.tw/article/20210522/3864... \n", "... ... ... \n", "4281 4290 https://www.businesstoday.com.tw/article/categ... \n", "4282 4291 https://house.udn.com/house/story/11134/4221093 \n", "4283 4292 https://www.ettoday.net/news/20190928/1543997.htm \n", "4284 4293 https://www.epochtimes.com/b5/19/6/13/n1132041... \n", "4285 4294 https://www.epochtimes.com/gb/19/6/13/n1132041... \n", "\n", " news_content news_day \\\n", "0 南加地產火熱,但建材漲,人工貴,房價順勢上揚。(記者李雪/攝影)\\n\\n新冠疫情 掀起一場風... None \n", "1 水泥為所有建設、工程主要原料,又是大宗民生物資之一,在水泥價格預期調漲下,市場指出,水泥漲價... 2021-05-05 04:10:00+ \n", "2 舉報\\n\\n疫情爆發以來,本材價格急升超過三倍。 CBC\\n\\n【星島綜合報道】北美建材價格... 2021-05-19 00:00:00 \n", "3 房地產業做為台灣經濟的火車頭,從土地、建物、建材、家具乃至設計裝潢產業等,市場皆不容小覷。2... 2021-04-30 00:00:00 \n", "4 同意 AGREE\\n\\n如果您繼續閱讀,視同您同意我們隱私條款。This website u... 2021-05-22 10:02:11+ \n", "... ... ... \n", "4281 黃之揚先前做過許多工作,卻都難以持久,怎知一碰見海洋垃圾,就成了他終身志業的所在。\\n\\n來... 2019-11-08 11:38:20+ \n", "4282 獎!獎!獎!清景麟「白易居THE ARCH」 獲三項國際設計獎\\n\\n撰文.攝影/張世雅\\n... None \n", "4283 文/時尚家居 空間設計暨圖片提供/東之光室內裝修設計\\n\\n\\n\\n《2019艾特獎 最佳辦... 2019-09-28 15:00:00 \n", "4284 室內裝修糾紛多?專家指點5步驟 避開陷阱不吃虧\\n\\n【大紀元2019年06月13日訊】(大... 2019-06-17 19:59:11+ \n", "4285 室内装修纠纷多?专家指点5步骤 避开陷阱不吃亏\\n\\n【大纪元2019年06月13日讯】(大... 2019-06-17 19:59:11+ \n", "\n", " crawler_date \n", "0 2021年05月23日 \n", "1 2021年05月23日 \n", "2 2021年05月23日 \n", "3 2021年05月23日 \n", "4 2021年05月23日 \n", "... ... \n", "4281 2021年06月13日 \n", "4282 2021年06月13日 \n", "4283 2021年06月13日 \n", "4284 2021年06月13日 \n", "4285 2021年06月13日 \n", "\n", "[4286 rows x 5 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "db = dataset.connect('mysql://choozmo:pAssw0rd@db.ptt.cx:3306/hhh?charset=utf8mb4')\n", "result = db.query('SELECT * FROM gnews.gnews_detail ')\n", "data = pd.DataFrame(result, columns=next(iter(result)).keys())\n", "data" ] }, { "cell_type": "code", "execution_count": 3, "id": "e6458138", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "id 0\n", "news_url 0\n", "news_content 0\n", "news_day 0\n", "crawler_date 0\n", "dtype: int64" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.isna().sum()" ] }, { "cell_type": "code", "execution_count": 4, "id": "8c16f7da", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnews_urlnews_contentnews_daycrawler_date
02https://www.worldjournal.com/wj/story/121362/5...南加地產火熱,但建材漲,人工貴,房價順勢上揚。(記者李雪/攝影)\\n\\n新冠疫情 掀起一場風...None2021年05月23日
13https://www.chinatimes.com/newspapers/20210505...水泥為所有建設、工程主要原料,又是大宗民生物資之一,在水泥價格預期調漲下,市場指出,水泥漲價...2021-05-05 04:10:00+2021年05月23日
24https://www.singtao.ca/4950752/2021-05-19/news...舉報\\n\\n疫情爆發以來,本材價格急升超過三倍。 CBC\\n\\n【星島綜合報道】北美建材價格...2021-05-19 00:00:002021年05月23日
35https://tw.appledaily.com/property/20210430/LE...房地產業做為台灣經濟的火車頭,從土地、建物、建材、家具乃至設計裝潢產業等,市場皆不容小覷。2...2021-04-30 00:00:002021年05月23日
46https://news.sina.com.tw/article/20210522/3864...同意 AGREE\\n\\n如果您繼續閱讀,視同您同意我們隱私條款。This website u...2021-05-22 10:02:11+2021年05月23日
..................
42814290https://www.businesstoday.com.tw/article/categ...黃之揚先前做過許多工作,卻都難以持久,怎知一碰見海洋垃圾,就成了他終身志業的所在。\\n\\n來...2019-11-08 11:38:20+2021年06月13日
42824291https://house.udn.com/house/story/11134/4221093獎!獎!獎!清景麟「白易居THE ARCH」 獲三項國際設計獎\\n\\n撰文.攝影/張世雅\\n...None2021年06月13日
42834292https://www.ettoday.net/news/20190928/1543997.htm文/時尚家居 空間設計暨圖片提供/東之光室內裝修設計\\n\\n\\n\\n《2019艾特獎 最佳辦...2019-09-28 15:00:002021年06月13日
42844293https://www.epochtimes.com/b5/19/6/13/n1132041...室內裝修糾紛多?專家指點5步驟 避開陷阱不吃虧\\n\\n【大紀元2019年06月13日訊】(大...2019-06-17 19:59:11+2021年06月13日
42854294https://www.epochtimes.com/gb/19/6/13/n1132041...室内装修纠纷多?专家指点5步骤 避开陷阱不吃亏\\n\\n【大纪元2019年06月13日讯】(大...2019-06-17 19:59:11+2021年06月13日
\n", "

3678 rows × 5 columns

\n", "
" ], "text/plain": [ " id news_url \\\n", "0 2 https://www.worldjournal.com/wj/story/121362/5... \n", "1 3 https://www.chinatimes.com/newspapers/20210505... \n", "2 4 https://www.singtao.ca/4950752/2021-05-19/news... \n", "3 5 https://tw.appledaily.com/property/20210430/LE... \n", "4 6 https://news.sina.com.tw/article/20210522/3864... \n", "... ... ... \n", "4281 4290 https://www.businesstoday.com.tw/article/categ... \n", "4282 4291 https://house.udn.com/house/story/11134/4221093 \n", "4283 4292 https://www.ettoday.net/news/20190928/1543997.htm \n", "4284 4293 https://www.epochtimes.com/b5/19/6/13/n1132041... \n", "4285 4294 https://www.epochtimes.com/gb/19/6/13/n1132041... \n", "\n", " news_content news_day \\\n", "0 南加地產火熱,但建材漲,人工貴,房價順勢上揚。(記者李雪/攝影)\\n\\n新冠疫情 掀起一場風... None \n", "1 水泥為所有建設、工程主要原料,又是大宗民生物資之一,在水泥價格預期調漲下,市場指出,水泥漲價... 2021-05-05 04:10:00+ \n", "2 舉報\\n\\n疫情爆發以來,本材價格急升超過三倍。 CBC\\n\\n【星島綜合報道】北美建材價格... 2021-05-19 00:00:00 \n", "3 房地產業做為台灣經濟的火車頭,從土地、建物、建材、家具乃至設計裝潢產業等,市場皆不容小覷。2... 2021-04-30 00:00:00 \n", "4 同意 AGREE\\n\\n如果您繼續閱讀,視同您同意我們隱私條款。This website u... 2021-05-22 10:02:11+ \n", "... ... ... \n", "4281 黃之揚先前做過許多工作,卻都難以持久,怎知一碰見海洋垃圾,就成了他終身志業的所在。\\n\\n來... 2019-11-08 11:38:20+ \n", "4282 獎!獎!獎!清景麟「白易居THE ARCH」 獲三項國際設計獎\\n\\n撰文.攝影/張世雅\\n... None \n", "4283 文/時尚家居 空間設計暨圖片提供/東之光室內裝修設計\\n\\n\\n\\n《2019艾特獎 最佳辦... 2019-09-28 15:00:00 \n", "4284 室內裝修糾紛多?專家指點5步驟 避開陷阱不吃虧\\n\\n【大紀元2019年06月13日訊】(大... 2019-06-17 19:59:11+ \n", "4285 室内装修纠纷多?专家指点5步骤 避开陷阱不吃亏\\n\\n【大纪元2019年06月13日讯】(大... 2019-06-17 19:59:11+ \n", "\n", " crawler_date \n", "0 2021年05月23日 \n", "1 2021年05月23日 \n", "2 2021年05月23日 \n", "3 2021年05月23日 \n", "4 2021年05月23日 \n", "... ... \n", "4281 2021年06月13日 \n", "4282 2021年06月13日 \n", "4283 2021年06月13日 \n", "4284 2021年06月13日 \n", "4285 2021年06月13日 \n", "\n", "[3678 rows x 5 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = data.drop_duplicates(subset=['news_content'], keep='first').drop_duplicates(subset=['news_url'], keep='first')\n", "data" ] }, { "cell_type": "code", "execution_count": 5, "id": "928c2993", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Building prefix dict from the default dictionary ...\n", "Loading model from cache /tmp/jieba.cache\n", "Loading model cost 0.404 seconds.\n", "Prefix dict has been built successfully.\n" ] } ], "source": [ "import json\n", "import operator\n", "from typing import List, Tuple, Optional\n", "import os\n", "import jieba\n", "import jieba.posseg as pseg\n", "\n", "jieba.load_userdict('jieba_add_word.txt')\n", "jieba.load_userdict('jieba_add_word_kw_with_weighting.txt')\n", "\n", "\n", "check_pos_list=[]\n", "\n", "# Check if contains num\n", "def notNumStr(instr):\n", " for item in instr:\n", " if '\\u0041' <= item <= '\\u005a' or ('\\u0061' <= item <='\\u007a') or item.isdigit():\n", " return False\n", " return True\n", "\n", "# Read Target Case if Json\n", "def readSingleTestCases(testFile):\n", " with open(testFile) as json_data:\n", " try:\n", " testData = json.load(json_data)\n", " except:\n", " # This try block deals with incorrect json format that has ' instead of \"\n", " data = json_data.read().replace(\"'\",'\"')\n", " try:\n", " testData = json.loads(data)\n", " # This try block deals with empty transcript file\n", " except:\n", " return \"\"\n", " returnString = \"\"\n", " for item in testData:\n", " try:\n", " returnString += item['text']\n", " except:\n", " returnString += item['statement']\n", " return returnString\n", "\n", "class Word():\n", " def __init__(self, char, freq = 0, deg = 0):\n", " self.freq = freq\n", " self.deg = deg\n", " self.char = char\n", "\n", " def returnScore(self):\n", " return self.deg/self.freq\n", "\n", " def updateOccur(self, phraseLength):\n", " self.freq += 1\n", " self.deg += phraseLength\n", "\n", " def getChar(self):\n", " return self.char\n", "\n", " def updateFreq(self):\n", " self.freq += 1\n", "\n", " def getFreq(self):\n", " return self.freq\n", "\n", "class Rake:\n", "\n", " def __init__(self): # , stopwordPath: str = None, delimWordPath: str = None):\n", " # If both Found and Initialized\n", " self.initialized = False\n", " self.stopWordList = list()\n", " self.delimWordList = list()\n", "\n", " def initializeFromPath(self, stopwordPath: str = \"\", delimWordPath: str = \"\"):\n", " if not os.path.exists(stopwordPath):\n", " print(\"Stop Word Path invalid\")\n", " return\n", "\n", " if not os.path.exists(delimWordPath):\n", " print(\"Delim Word Path Invalid\")\n", " return\n", "\n", " \n", " swLibList = [line.rstrip('\\n') for line in converter.convert(open(stopwordPath,'r').read()).split('\\n')]\n", " conjLibList = [line.rstrip('\\n') for line in converter.convert(open(delimWordPath,'r').read()).split('\\n')]\n", " \n", " \n", " self.initializeFromList(swLibList, conjLibList)\n", " return\n", " \n", " def initializeFromList(self, swList : List = None, dwList : List = None):\n", " self.stopWordList = swList\n", " self.delimWordList = dwList\n", " \n", " if len(self.stopWordList) == 0 or len(self.delimWordList) == 0:\n", " print(\"Empty Stop word list or deliminator word list, uninitialized\")\n", " return\n", " else:\n", " self.initialized = True\n", "\n", " def extractKeywordFromPath(self, text : str, num_kw : int = 10):\n", " if not self.initialized:\n", " print(\"Not initialized\")\n", " return \n", "\n", " with open(text,'r') as fp:\n", " text = fp.read()\n", " return self.extractKeywordFromString(text, num_kw = num_kw)\n", " \n", " def extractKeywordFromString(self, text : str, num_kw : int = 10):\n", " rawtextList = pseg.cut(text)\n", "\n", " \n", " # Construct List of Phrases and Preliminary textList\n", " textList = []\n", " listofSingleWord = dict()\n", " lastWord = ''\n", "\n", " # for jieba\n", " poSPrty = ['zg','m','x','uj','ul','mq','u','v','f','t','vd','q','r','d','p','nr','r'\n", " 'c','TIME','xc','a','ad','an','nrt','df','b','vn','l','y','o','i']\n", "\n", " meaningfulCount = 0\n", " checklist = []\n", " for eachWord, flag in rawtextList:\n", " \n", " \n", " check_pos_list.append(str(eachWord)+'/'+str(flag))\n", "\n", " checklist.append([eachWord,flag])\n", " if eachWord in self.delimWordList or not notNumStr(eachWord) or eachWord in self.stopWordList or flag in poSPrty or eachWord == '\\n':\n", " if lastWord != '|':\n", " textList.append(\"|\")\n", " lastWord = \"|\"\n", " elif eachWord not in self.stopWordList and eachWord != '\\n':\n", " textList.append(eachWord)\n", " meaningfulCount += 1\n", " if eachWord not in listofSingleWord:\n", " listofSingleWord[eachWord] = Word(eachWord)\n", " lastWord = ''\n", "\n", " # Construct List of list that has phrases as wrds\n", " newList = []\n", " tempList = []\n", " for everyWord in textList:\n", " if everyWord != '|':\n", " tempList.append(everyWord)\n", " else:\n", " newList.append(tempList)\n", " tempList = []\n", "\n", " tempStr = ''\n", " for everyWord in textList:\n", " if everyWord != '|':\n", " tempStr += everyWord + '|'\n", " else:\n", " if tempStr[:-1] not in listofSingleWord:\n", " listofSingleWord[tempStr[:-1]] = Word(tempStr[:-1])\n", " tempStr = ''\n", "\n", " # Update the entire List\n", " for everyPhrase in newList:\n", " res = ''\n", " for everyWord in everyPhrase:\n", " listofSingleWord[everyWord].updateOccur(len(everyPhrase))\n", " res += everyWord + '|'\n", " phraseKey = res[:-1]\n", " if phraseKey not in listofSingleWord:\n", " listofSingleWord[phraseKey] = Word(phraseKey)\n", " else:\n", " listofSingleWord[phraseKey].updateFreq()\n", "\n", " # Get score for entire Set\n", " outputList = dict()\n", " for everyPhrase in newList:\n", "\n", " if len(everyPhrase) > 5:\n", " continue\n", " score = 0\n", " phraseString = ''\n", " outStr = ''\n", " for everyWord in everyPhrase:\n", " score += listofSingleWord[everyWord].returnScore()\n", " phraseString += everyWord + '|'\n", "\n", " \n", " outStr += everyWord\n", " phraseKey = phraseString[:-1]\n", " freq = listofSingleWord[phraseKey].getFreq()\n", "\n", " \n", " if freq / meaningfulCount < 0.05 and freq < 3 :\n", " continue\n", " outputList[outStr] = score\n", " \n", " \n", " return list(outputList.keys())" ] }, { "cell_type": "code", "execution_count": 6, "id": "cb3b8175", "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Stop Word Path invalid\n", "['', '從而', '同時', '論', '俺們', '一', '但因為', '只要', '而且', '彰攝', '兩者', '育攝', '是', '正是', '塵頭', '如若', '基於', '地將', '遵循', '資料', '啷噹', '呵', '方所', '尺寸', 'en', '但凡', '毋寧', '之', '重置', 'few', '一則', '發現', '2', '只為', '待', '來', '乃至', 'mustn', '礦山', '價落', 'on', '來着', '打造', '按照', 'jpg', '即令', '腳酸', '樣樣', '此外', '地產行業極具', '房大', '可能', '周知', '圖片', '大面', '1', '固然', 'use', '隨着', '房明明', '再者說', '他人', '最具', '怎麼樣', 'only', '致力', '感會', '首先', 'make', '一切', '至若', '較之', '異境', '妹妹', '有時', '如上', '但以', '上下', '當然', 'should', '云爾', '戰果', '靈魂', '任憑', 'me', '產品', '假若', '編號', '空間', '在於', '息息', '影響', '屋況整理乾', '寧願', '無所畏懼', '光是', '類科', '小心', '這就是說', '筆額', '門易關', '並非', '要務', '到', '此時', '以來', '替', '啦', '能', 'video', '{', 'as', '這麼些', '猶且', '依', '要不是', 'other', 'shan', '文中', '身上', '事情', 'but', '乘', '「', '事宜', '撰文編輯', '計師', 'Courtesy', '簡言之', '別以', '地上', '一不注意', 'will', '類如', '便於', '非特', '正如', 'one', 'at', '心裁', '如何', '有些', 'were', '太苦', '如下', '人', '及', '她', '玩意', '性化', '直到', '所在', '那麼', '可見', 'is', '唯有', '今', 'he', '但其', '坪房子實際', '鑑於', '聞雲', '男子', '前者', '具優良風評', '越是', '對', '誠如', '私性', '等', 'years', '公室', '記者林和謙', '先量', '子境', '萬一', '作者', '於小宅', '舉例', '什麼樣', '繼而', '誰', '何時', '嗡嗡', '莫若', 'hadn', '不過', '意願', '運氣', '我們', 'ma', '很', '大量', '由於', '其', '身試法', 'with', '衣空間', '房會', '歸齊', '刀口', 'isn', '內所', '文章曝光後', '用心', '這一來', '或則', '建議', '中正', \"haven't\", '產品時', '直指', '地', '與此同時', '師溝', '豈但', '若是', '騙人', '借', 'do', '們', '諸', '紀錄', '產資訊平台', '這麼樣', '而外', '彼此', 'she', '”', '嚇', '事項', '據', \"wouldn't\", '仔細比', '甚至', '成房貸', '一樣', 'herself', '繼後', '住房地租', 'weren', '記錄', '魔法', '但', '惟其', '難題', '有形', '其一', 'which', '以免', '!', '截至', '台灣', '目前', '無人', '連同', '巴巴', '一般', '除了', 'when', '前此', '經過', '$', '業人員', 'that', 'didn', '總而言之', '這樣', '人居', '但怕視', '道理', '此次', '金代', '不合理', '在下', '嘍', '用法', '矣乎', '的確', '這邊', 'for', '誰人', '設使', 'same', '兒', '那會兒', '能力', 'no', '細心', '建案總', '部份', '諸如', '若夫', '向着', '設若', '盡', '事物', 'are', '別', '介於', '諸位', 'way', 'did', '遵照', \"doesn't\", 'nor', '常會', '主力', '另悉', '之所', '臨', '哉', '獎入', '以', '儘管', '衣機', '設或', \"shan't\", '又及', '拿', '甚至於', \"hadn't\", '能容', '怎麼', '爲止', '哪天', '誤會', '意義', 'we', '差距', '非但', '嗚', '看', '公司名稱', \"shouldn't\", '離', \"weren't\", 'until', '風水上', '倘使', \"she's\", '邊', '根據', 'shouldn', '致', '能溝', '各種', '他們', '每', '會比', '數範圍', '哪邊', '帝國', 'was', '嘻', '處在', '咳', '不如', '自打', '然後', '並說', '花錢', '此間', '譬如', '常態', '自個兒', ',', '規業者', '即如', '只怕', '比如', 'below', '一旦', '怎', '喂', '故此', '也好', '記者', '不惟', '這時', '熱議', '烏乎', 'through', '從此', '中路', '之所以', '由此可見', '最', '般的', '於新穎', '師盧', '接着', '矣哉', \"won't\", '邊機', '甚而', '爲了', '式廚房', '本身', '省地', '第一', '開始', '只限', '國中美術課', 'yours', '傻傻的', '不外乎', '寧可', '戚戚焉', '_', 'Trophy', '哈哈', '各自', '記者許凱彰攝', '先裁', '則', '間界線', 're', 'own', '誰料', 'by', '奇葩', '而為', '本站', '質與量', '無寧', '它們', '嘿', '其它', '難道說', '本心', '或過', '保值性因機能', '結果', '果真', '僵持不下', '能點', \"isn't\", 'who', '\\n', '於是乎', '(。', '天生', '母法', 'here', '由是', '幽靈', 'there', '總的來看', '不妨', '後', '極了', '而言', '呼哧', '然而', '獲得', '反過來說', '時會', '歟', '使', '才', '手筆', '不盡', '各位', '報導', '版主', '這些', '風傳媒節', '按', '所', '內', '六大', '我', '喲', '企業代', '加以', '奇奇怪怪', '不肖', '台北報導', 'than', '使得', '》', 'themselves', '又', 'of', 'yourselves', '重重的', '打', '衝', '爲此', 'projects', '盡有', '所幸', '本', '屋會', '別處', '老實', '身心俱疲', '因坪數小', 'image', 'down', '云云', '距', '再者', '、', 'just', 'i', '哦', '一些', '讓', 'new', '不只', '無法', '因素', 'or', '要是', '叫', '數算', '時候', '啐', '絕佳', '因為房價', '樓半', \"mightn't\", '自後', '意念', '回家', '只', '各', '屋裡', '本人', '並以', '事交', '2019', '方式', '將房門關', '多麼', '不容', '不光', '客買盤', '以故', 'name', '跟', '非徒', '呀', '沿', '名稱', '東販', '賊死', '除開', '則甚', 'ours', '什麼', '誠然', '人群', '旺好', '單者', '要不', 'under', '巴', '人點', '正派', '咚', '譁', '心中', '他', \"hasn't\", '靠山', 'won', '對於', '唉', '因為床', '不特', 'into', '大', '。', '據此', '人性所需', '再其次', 't', '寬境', '就是', '之類', 'category', '比及', '嗡', '嘎', '等等', '哪些', '呢', '既往', '哼', '因了', '望', 'and', '內政部部務會報', '至今', \"you'd\", '雖', '且說', '這個', '人家', '與', '區內湖路', '獵心', '哎呀', 'ourselves', '字頭', \"couldn't\", '億元', '些', '筆錢', '此', '不然', '一個', '將客', '即便', '小包', '雖然', ':', '人員', '內心', '房前', '管', '起亞', '不怕', 'hers', '以為', 'any', '二是', '沿着', '上會', '何處', '所有', '於是', '形式交屋', '區域分', '順着', '即或', '如', '難事', '縱使', '不論是', '今年', '作品參賽', '靜靜的', '屋魔鬼系解', '哥哥', '先求', '6', '區處', '哎喲', '其次', 'after', '高大', '其中', '關於', 'haven', '如同', '出來', '一格', '買幾', '或', '林喬慧', '省得', '幸福', '同', '甜頭', '哪兒', '悠然', '記者嚴鈺雯', '再說', '並', '定義', '矣', '記者張菱育攝', 'my', 'off', 'all', '或是', '就是說', '雲司', '來定義', '記者蘇瑜', '嗬', '分別', '高度喔', '者', '以期', '那樣', '爾後', 'whom', '俺', '此處', 'during', '原本', '不盡然', '不成', '公分', '且', '新高', '果然', '這', '甚或', '針對', 'your', 'their', 'them', '開外', '目光', '精彩', '相對而言', '標題', '當', '那麼樣', '除此之外', '記者林裕豐', 'not', 'itself', 'each', '煥然', 'this', '記者許凱彰', '別說', '數介', '人字', '的話', '記者戴鈺純', '精力', '麼', '賴以', '股份', '不得', '己', '\\n\\n', '界權威', '記者蘇', '%', '隨', '嗯', '因此', '房族', \"aren't\", '因人', '師建議', '由此', '一物', 'himself', '表示', '被', '既', '而後', '當地', '趕', '哪年', '自', '大門尿尿', '那', '不是', '例如', '本地', '那邊', '大台', '沖馬', '從', '有及', '爲', 'some', '房則', '怪物', '兼之', '不拘', 'having', '近百', '得', '客才', '只不過', '此地', 'its', '庶乎', '問題', '你', '即若', '屋人房貸', '對方', '該', 'theirs', '咱', '何況', 'while', 'has', '是的', '但是', '在', \"should've\", '成日', '指出', '照着', \"didn't\", '儻然', '大大的', 'you', '替代', '隨後', '總的說來', '器應裝', '字頭熱門宅', '可區', '儘管如此', '四大', '然則', '身邊朋友', '.', '但要', '着', '坪格局', '仍', 'couldn', '爾爾', '人熱心', '易度', '其他人', '還', '公司', '數產品', '層樓華', '因', 'our', '專門', '孰料', '性差', '樣子', '依舍', '若果 ', '或者', '人會', '網吐槽', '本着', '?', '提供', 's', '精神', 'against', '雖以', '一來', '比方', 'if', '庶幾', '費超', '维境', '性比', '哎', '費越', 'very', '與其說', '何以', '各個', '咧', '騰', '也', '哪怕', 'd', '怎麼辦', '可是', '那個', '時格局', '數動輒', '慘狀', '濕機', '起', 'the', 'what', '3', '阿', '手邊', '情況', '前後', '已矣', 'out', '靠', '對待', '坪上', '人士', '以便', '都', '漫說', '或其', '感覺', '正巧', '它', 'his', '哪', '因着', '身分', '受到', '前務', '縱令', 'to', '倘然', '綜上所述', '著文具', '反觀', '力度', 'aren', \"mustn't\", '屬量', '好', \"you'll\", '不比', '另外', 'where', '也罷', 'wouldn', '意趣', '有', '只當', '拉大', '土生土長', 'does', '全部', '以及', '和平', '記者呂詠', '頻軍', '說來', '叮咚', '只消', '吱', 'because', '格格不入', '飛彈', '某', '內文', '例子', '能爭', '作品', '力道', '以銀行匯款', '通過', '正值', '但若買家', '莫不然', '《', 'ain', '假使', 'how', '容易', '2020', '上會感覺', '以上', '緊接着', '許多', '而是', '由', '照度', ' ', '發展', '既是', '的', '小小的', 'doing', '所國', '哪個', '朝着', '或以', '具體地說', '二來', 'in', '可及', '並且', '爲什麼', '項缺點', '鄙人', '如是', '舊率', '房改', '乎', '慢說', '既有', 'more', 'an', '不論', '焉', '識度', '了', '全世界', '數約', '有所', '至於', '因為', 'so', '需求', '本質', '誰知', '凡是', '太大', '建議民眾', '去年', 'those', '呸', '賞心悅目', 'time', '}', 'doesn', '腦筋', '別的', '工班溝', '而已', '呃', '如其', 'him', '逐步', '竟而', '式更衣室', '甚且', '屋民眾', '彼時', '縱', 'says', '即使', '不獨', '亦', '上百', '4', '加之', '歸', '因而', '可說', '那般', '隨時', '乾脆', '過類', '寧', '越小越', '越大越', '筆買房', '且不說', '-', \"needn't\", '哇', '這次', '這會兒', 'be', '可', '海陸', '無論', 'further', '啊', '幾', '但若', '如上所述', '着呢', 'o', '」', '東西', '每當', '體上', '那兒', '爲着', 'm', '居者', '嘿嘿', '記者黃可昀', '與否', '科技大學財務金融系副教授', '換句話說', '形家具', ')', '商願意', '連', 'about', '反而', 'also', 'once', '中文版', '部分', '因應', '但僅', '內能', '哪裏', '不若', '咦', '換言之', '若以', '打從', '吧噠', '以此', '喏', '會身', '天下', 've', '若非', '若', 'before', '凡', '怎奈', '似的', '飾性', '曾耳聞', '當着', '意境', '心思', '人礙', '尚且', '至', '而', '嘛', '寧肯', '猶自', '給', '這位網友文中', 'over', '經由', \"you're\", '有質', '大腿', 'hasn', 'space', '將光', '數小', '不但', '透過', '情景', '別人', '假如', '和', '經', '人建議', 'now', '某某', '以至於', '自身', '傳媒', '雞毛', '成果', 'between', 'most', '自家', '這麼', '縱然', 'next', '那麼些', 'a', '照', '太多', '區分', '非獨', '自各兒', '嗚呼', '那些', '其二', 'myself', '再', '買房時', '動人', '卻', 'from', '是以', '江怡慧', '旁人', '需拉明線', '心目', '?', '手中', '師能依消費者習慣', 'am', '屋人', '所以', '除', \"you've\", '將', '先行', '哩', '不翼而飛', '要略', 'up', '不至於', '價是', '以至', '爲何', '人化', '不問', '往', '爾', '繼之', '比創', '依照', '罷了', '第', '只是', '雖則', ')。', '家行', '仍舊', '個別', '還要', '再有', '冒', '這般', '泥代', '出於', '這裏', '厚積', '辦法', '幹嘛', '世人', '套衛浴', '端的', '比', '一方面', '笨蛋', '不卡', '趁', '喔唷', '能否', \"that'll\", '三房實質', '泡泡大件衣物', '等到', 'too', '自從', '筆者', '憑藉', '個', '報報', 'it', '先不先', '貴公', '(', '否則', 'well', '烤雞', '別是', '產生', '成川建商新案時', '進而', '對比', '那裏', '啪達', '人能', '若為', '法親', '並同時', '你們', '還是', '林明', '主要', '文曝光', '來說', '反過來', '另', '有的', '才能', '始而', '於坪數', '已', 'then', '式空間', '理事', '萬元', '故而', '那時', '因爲', 'both', '情勢', '後者', '不管', '順', '餘外', '依據', '咱們', '但小坪', '方為', 'er', '投機客會', 'wasn', '多', '以爲', '雙方', '0', 'inline', '像', '奔騰瀟', \"it's\", '全體', '東南亞', '向使', '要', '作爲', '之一', '數字', 'don', '總的來說', '不為人知', '沒奈何', '中散', '先裝潢', '倘或', '因為人', '或曰', '網友坦言', '加泥共', '會落', '故', '塞進', '吧', '套共用', '一空', '可以', '原因', '幾時', '10', '心態', '其他', 'best', '來自', '感興趣', 'these', '嘔', '不', '別管', '數比', '最佳', 'needn', '具體說來', '而有', '再則', '除非', \"don't\", \"wasn't\", '起見', 'y', '此為', '乃', '無', '於', '況且', '倘若', 'been', '去', '以致', '知域', '既然', '就算', '某個', '整體感覺', '很漂亮', '任', '會準', '何', '啥', '抑或', '一何', '譬喻', '就', '新房子手上', '網友熱議', '哈', '及其', '5', 'yourself', '向', '總之', '莫如', '就是了', '趁着', '利於', '使用', '兮', '呵呵', '“', '樓視野', '用光', '某些', '古有明', '憑', '自己', '曾', '下', '倘', '上', '答案', '多少', '以外', '恰恰相反', '實力由此', '不單', 'above', '任何', '下場', 'again', '與其', '路上', '心情', '哪樣', '7', 'why', '到哪去', '有限', '噓', '只有', '嗎', '較', '把', '9', '競相', '除外', 'being', 'can', 'like', '/', '天上', '如此', 'have', '眼見', '眼睛', '幻想篇章', 'such', '有關', '人們', '及至', '大家', '就要', '若要', '終於', '房太', 'see', '依法', '音因', '總是', '因感', '售屋時', 'had', '朝', '如果', '過', '不僅', '持續', '孰知', '充份', ';', '其餘', '有限公司', '會花', '價王', 'year', '人心', '造型天花', '題發文', 'mightn', '用', 'they', '雖說', '連帶', '一轉眼', '\\n\\n◎', '要麼', '房後', '另一方面', '她們', '大公', '氣憤', '賓客目光', '乃至於', '不料', '無虞', '哼唷', '關卡入', '眨眼', 'her', '性需求', '藏金', '8', '怎樣', '文本', '用來', '四大原因', '而況', '反之', 'll', '得了', '甚麼', '領教', '要不然', '小化', '數業者', '這麼點兒', '唄', '這兒', '即', '咋', '您', '小', '內容', '噯', '寓所', '還有', '記者陳韋帆攝', '彼', '步入', '人買房', '嘎登', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', \"you're\", \"you've\", \"you'll\", \"you'd\", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', \"she's\", 'her', 'hers', 'herself', 'it', \"it's\", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', \"that'll\", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', \"don't\", 'should', \"should've\", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', \"aren't\", 'couldn', \"couldn't\", 'didn', \"didn't\", 'doesn', \"doesn't\", 'hadn', \"hadn't\", 'hasn', \"hasn't\", 'haven', \"haven't\", 'isn', \"isn't\", 'ma', 'mightn', \"mightn't\", 'mustn', \"mustn't\", 'needn', \"needn't\", 'shan', \"shan't\", 'shouldn', \"shouldn't\", 'wasn', \"wasn't\", 'weren', \"weren't\", 'won', \"won't\", 'wouldn', \"wouldn't\", '$', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '?', '_', '“', '”', '、', '。', '《', '》', '一', '一些', '一何', '一切', '一則', '一方面', '一旦', '一來', '一樣', '一般', '一轉眼', '萬一', '上', '上下', '下', '不', '不僅', '不但', '不光', '不單', '不只', '不外乎', '不如', '不妨', '不盡', '不盡然', '不得', '不怕', '不惟', '不成', '不拘', '不料', '不是', '不比', '不然', '不特', '不獨', '不管', '不至於', '不若', '不論', '不過', '不問', '與', '與其', '與其說', '與否', '與此同時', '且', '且不說', '且說', '兩者', '個', '個別', '臨', '爲', '爲了', '爲什麼', '爲何', '爲止', '爲此', '爲着', '乃', '乃至', '乃至於', '麼', '之', '之一', '之所以', '之類', '烏乎', '乎', '乘', '也', '也好', '也罷', '了', '二來', '於', '於是', '於是乎', '云云', '云爾', '些', '亦', '人', '人們', '人家', '什麼', '什麼樣', '今', '介於', '仍', '仍舊', '從', '從此', '從而', '他', '他人', '他們', '以', '以上', '以爲', '以便', '以免', '以及', '以故', '以期', '以來', '以至', '以至於', '以致', '們', '任', '任何', '任憑', '似的', '但', '但凡', '但是', '何', '何以', '何況', '何處', '何時', '餘外', '作爲', '你', '你們', '使', '使得', '例如', '依', '依據', '依照', '便於', '俺', '俺們', '倘', '倘使', '倘或', '倘然', '倘若', '借', '假使', '假如', '假若', '儻然', '像', '兒', '先不先', '光是', '全體', '全部', '兮', '關於', '其', '其一', '其中', '其二', '其他', '其餘', '其它', '其次', '具體地說', '具體說來', '兼之', '內', '再', '再其次', '再則', '再有', '再者', '再者說', '再說', '冒', '衝', '況且', '幾', '幾時', '凡', '凡是', '憑', '憑藉', '出於', '出來', '分別', '則', '則甚', '別', '別人', '別處', '別是', '別的', '別管', '別說', '到', '前後', '前此', '前者', '加之', '加以', '即', '即令', '即使', '即便', '即如', '即或', '即若', '卻', '去', '又', '又及', '及', '及其', '及至', '反之', '反而', '反過來', '反過來說', '受到', '另', '另一方面', '另外', '另悉', '只', '只當', '只怕', '只是', '只有', '只消', '只要', '只限', '叫', '叮咚', '可', '可以', '可是', '可見', '各', '各個', '各位', '各種', '各自', '同', '同時', '後', '後者', '向', '向使', '向着', '嚇', '嗎', '否則', '吧', '吧噠', '吱', '呀', '呃', '嘔', '唄', '嗚', '嗚呼', '呢', '呵', '呵呵', '呸', '呼哧', '咋', '和', '咚', '咦', '咧', '咱', '咱們', '咳', '哇', '哈', '哈哈', '哉', '哎', '哎呀', '哎喲', '譁', '喲', '哦', '哩', '哪', '哪個', '哪些', '哪兒', '哪天', '哪年', '哪怕', '哪樣', '哪邊', '哪裏', '哼', '哼唷', '唉', '唯有', '啊', '啐', '啥', '啦', '啪達', '啷噹', '喂', '喏', '喔唷', '嘍', '嗡', '嗡嗡', '嗬', '嗯', '噯', '嘎', '嘎登', '噓', '嘛', '嘻', '嘿', '嘿嘿', '因', '因爲', '因了', '因此', '因着', '因而', '固然', '在', '在下', '在於', '地', '基於', '處在', '多', '多麼', '多少', '大', '大家', '她', '她們', '好', '如', '如上', '如上所述', '如下', '如何', '如其', '如同', '如是', '如果', '如此', '如若', '始而', '孰料', '孰知', '寧', '寧可', '寧願', '寧肯', '它', '它們', '對', '對於', '對待', '對方', '對比', '將', '小', '爾', '爾後', '爾爾', '尚且', '就', '就是', '就是了', '就是說', '就算', '就要', '盡', '儘管', '儘管如此', '豈但', '己', '已', '已矣', '巴', '巴巴', '並', '並且', '並非', '庶乎', '庶幾', '開外', '開始', '歸', '歸齊', '當', '當地', '當然', '當着', '彼', '彼時', '彼此', '往', '待', '很', '得', '得了', '怎', '怎麼', '怎麼辦', '怎麼樣', '怎奈', '怎樣', '總之', '總的來看', '總的來說', '總的說來', '總而言之', '恰恰相反', '您', '惟其', '慢說', '我', '我們', '或', '或則', '或是', '或曰', '或者', '截至', '所', '所以', '所在', '所幸', '所有', '才', '才能', '打', '打從', '把', '抑或', '拿', '按', '按照', '換句話說', '換言之', '據', '據此', '接着', '故', '故此', '故而', '旁人', '無', '無寧', '無論', '既', '既往', '既是', '既然', '時候', '是', '是以', '是的', '曾', '替', '替代', '最', '有', '有些', '有關', '有及', '有時', '有的', '望', '朝', '朝着', '本', '本人', '本地', '本着', '本身', '來', '來着', '來自', '來說', '極了', '果然', '果真', '某', '某個', '某些', '某某', '根據', '歟', '正值', '正如', '正巧', '正是', '此', '此地', '此處', '此外', '此時', '此次', '此間', '毋寧', '每', '每當', '比', '比及', '比如', '比方', '沒奈何', '沿', '沿着', '漫說', '焉', '然則', '然後', '然而', '照', '照着', '猶且', '猶自', '甚且', '甚麼', '甚或', '甚而', '甚至', '甚至於', '用', '用來', '由', '由於', '由是', '由此', '由此可見', '的', '的確', '的話', '直到', '相對而言', '省得', '看', '眨眼', '着', '着呢', '矣', '矣乎', '矣哉', '離', '竟而', '第', '等', '等到', '等等', '簡言之', '管', '類如', '緊接着', '縱', '縱令', '縱使', '縱然', '經', '經過', '結果', '給', '繼之', '繼後', '繼而', '綜上所述', '罷了', '者', '而', '而且', '而況', '而後', '而外', '而已', '而是', '而言', '能', '能否', '騰', '自', '自個兒', '自從', '自各兒', '自後', '自家', '自己', '自打', '自身', '至', '至於', '至今', '至若', '致', '般的', '若', '若夫', '若是', '若果 ', '若非', '莫不然', '莫如', '莫若', '雖', '雖則', '雖然', '雖說', '被', '要', '要不', '要不是', '要不然', '要麼', '要是', '譬喻', '譬如', '讓', '許多', '論', '設使', '設或', '設若', '誠如', '誠然', '該', '說來', '諸', '諸位', '諸如', '誰', '誰人', '誰料', '誰知', '賊死', '賴以', '趕', '起', '起見', '趁', '趁着', '越是', '距', '跟', '較', '較之', '邊', '過', '還', '還是', '還有', '還要', '這', '這一來', '這個', '這麼', '這麼些', '這麼樣', '這麼點兒', '這些', '這會兒', '這兒', '這就是說', '這時', '這樣', '這次', '這般', '這邊', '這裏', '進而', '連', '連同', '逐步', '通過', '遵循', '遵照', '那', '那個', '那麼', '那麼些', '那麼樣', '那些', '那會兒', '那兒', '那時', '那樣', '那般', '那邊', '那裏', '都', '鄙人', '鑑於', '針對', '阿', '除', '除了', '除外', '除開', '除此之外', '除非', '隨', '隨後', '隨時', '隨着', '難道說', '非但', '非徒', '非特', '非獨', '靠', '順', '順着', '首先', '!', ',', ':', ';', '?', '']\n", "CPU times: user 322 ms, sys: 12.5 ms, total: 335 ms\n", "Wall time: 333 ms\n" ] } ], "source": [ "%%time\n", "#透過Rake_For_Chinese將所有文章的關鍵字提取出來\n", "\n", "import random\n", "\n", "from nltk.corpus import stopwords\n", "import time\n", "import opencc\n", "\n", "converter = opencc.OpenCC('s2t.json')\n", "converter.convert('汉字') # 漢字\n", "cc_stopwords = converter.convert(open(\"cn_stopwords.txt\", \"r\").read()).split('\\n')\n", "\n", "\n", "\n", "obj = Rake()\n", "stop_path = \"./stoplist/中文停用词表(1208个).txt\"\n", "conj_path = \"./stoplist/中文分隔词词库.txt\"\n", "obj.initializeFromPath(stop_path, conj_path)\n", "\n", "\n", "\n", "# 廷用詞列表\n", "with open('customized_stopwords.pickle', 'rb') as handle:\n", " customized_stopwords = pickle.load(handle)\n", "customized_stopwords.extend(stopwords.words('english'))\n", "customized_stopwords.extend(cc_stopwords)\n", "\n", "print(customized_stopwords)" ] }, { "cell_type": "code", "execution_count": 7, "id": "4461d197", "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "【Transformer】Documents embedding ... DONE!\n", "embeddings.shape: (3678, 512)\n", "CPU times: user 1min 6s, sys: 2.92 s, total: 1min 8s\n", "Wall time: 17.8 s\n" ] } ], "source": [ "%%time\n", "from sentence_transformers import SentenceTransformer, util\n", "from transformers import AutoTokenizer, AutoModel, BertTokenizerFast\n", "import torch\n", "import umap\n", "from more_itertools import ichunked\n", "\n", "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n", "\n", "\n", "# documents embedding\n", "print('【Transformer】Documents embedding ... ',end='')\n", "model = SentenceTransformer('distiluse-base-multilingual-cased-v1')\n", "embeddings = model.encode(data['news_content'].tolist())\n", "print('DONE!')\n", "print('embeddings.shape:',embeddings.shape)" ] }, { "cell_type": "code", "execution_count": 8, "id": "b4f7c1cb", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "【HDBSCAN】Clustering ...DONE!\n", " -------------------------------------------------------------------------------- \n", "Noise ratio: 90.3 % 3321 / 3678\n", "... 滿心歡喜住進去,不久卻帶著平面圖逃出來的屋主還真不少,房子住起來卻不順,你可知問題到底出在哪?準備裝修或下手買屋之前,建議先進行九項常見問題格局檢視,提前摸透房子的特性,避免住進頭痛屋,日後必須花大把銀子改格局。當然,如果能早發現、早改正,才能早點迎接幸福生活!01|我家的窗戶不受「風」的歡迎通風的重點在於對流,而非開越多窗越好。格局設計不光思考人的動線是否順暢,風的動線也很重要。風會尋找最小距離直線前進,它沒辦法像人一樣可以轉身,用同一個開窗就能同時進與出,要創造風的動線必須要同時兩個對外窗,一個做為入口、一個做為出口。其次,風行徑時會選擇最短的直線路線來走,如何判斷平面圖是否有良好通風路徑 ...\n", " --------------------------------------------------------------------------------\n", "Predict_doc (target domain) is predicted to be in cluster # -1 (noise)\n", " -> Replace with the largest cluster # 31\n" ] }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAACNoAAAEwCAYAAABrdo7OAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8rg+JYAAAACXBIWXMAAAsTAAALEwEAmpwYAAA0GklEQVR4nO3de5hkdXkn8O8rI+EiF4FhHBgRVIIaMCAThGhELiKgEUS8oDFoiCiSrNco6m6iZvdZ3WSTuFnjSkRDIoIIKHjhJoKAUchwCygaEEG5CMMdBQSG3/7RBQGc7imYrjpNnc/neXi66lQ1/X2f39Spt+u8fU611gIAAAAAAAAAAMzsCV0HAAAAAAAAAACAxwODNgAAAAAAAAAAMASDNgAAAAAAAAAAMASDNgAAAAAAAAAAMASDNgAAAAAAAAAAMASDNgAAAAAAAAAAMIShBm2q6l1V9f2quqSqjqyq1apqs6o6p6our6ovVtWqow4LAAAAAAAAAABdqdbazE+o2jjJ2Ume01q7q6qOTvKNJHsmOa61dlRV/b8kF7XWPjXT/2uDDTZom2666ewkBwAAAAAAAACAWXbeeefd2Fqbv7zH5g35/5iXZPWqujfJGkmuS7JzktcPHj88yYeTzDhos+mmm2bJkiVD/kgAAAAAAAAAABivqrpqusdWeOmo1to1Sf46yU8zNWBzW5LzktzaWrtv8LSrk2y88lEBAAAAAAAAAGBuWuGgTVU9OcleSTZLslGSNZPsPuwPqKoDq2pJVS1ZunTpYw4KAAAAAAAAAABdWuGgTZJdk/yktba0tXZvkuOSvCDJulX1wKWnFiW5Znnf3Fo7tLW2uLW2eP785V6+CgAAAAAAAAAA5rxhBm1+mmT7qlqjqirJLkl+kOT0JPsOnrN/kuNHExEAAAAAAAAAALq3wkGb1to5SY5Jcn6Siwffc2iS9yd5d1VdnmT9JIeNMCcAAAAAAAAAAHRq3oqfkrTW/iLJXzxi8xVJtpv1RAAAAAAAAAAAMAcNc+koAAAAAAAAAADoPYM2AAAAAAAAAAAwBIM2AAAAAAAAAAAwBIM2AAAAAAAAAAAwBIM2AAAAAAAAAAAwhHldBwAAAAAAAAAAhnDtBV0nGK2Ntuk6AayQM9oAAAAAAAAAAMAQDNoAAAAAAAAAAMAQDNoAAAAAAAAAAMAQ5nUdAAAAAAAAAADozo8uvzKvPeiQB+9f8dNr8tH3vi3fPe/f86MfX5UkufX2O7Lu2mvlwlOP6iomzAkGbQAAAAAAAACgx7Z45qYPDtAsW7YsG2+7e165x05551ve8OBz3vORv8k6az+pq4gwZ7h0FAAAAAAAAACQJDnt7HPzjKctytMWbfTgttZajv7qqdlvr907TAZzg0EbAAAAAAAAACBJctTxJ2e/vV/6sG1nnXN+FsxfL5s/fZOOUsHcYdAGAAAAAAAAAMg999ybE045M69++Usetv3Ir5zsbDYwMK/rAAAAAAAAAABA9048/Tt53lbPyoL56z+47b777stxJ34r5514RIfJYO5wRhsAAAAAAAAAIEd+5aRfu2zUN886J8965qZZtNGCjlLB3GLQBgAAAAAAAAB67pd33pVTzzwn++yx88O2H3X8KS4bBQ/h0lEAAAAAAAAA0HNrrrF6bvr+6b+2/Z/+7iMdpIG5yxltAAAAAAAAAABgCAZtAAAAAAAAAABgCAZtAAAAAAAAAABgCAZtAAAAAAAAAABgCAZtAAAAAAAAAABgCAZtAAAAAAAAAABgCAZtAAAAAAAAAABgCCsctKmqLarqwof8d3tVvbOq1quqU6vqssHXJ48jMAAAAAAAAAAAdGGFgzattR+11rZurW2dZNskdyb5cpJDkpzWWts8yWmD+wAAAAAAAAAAMJEe7aWjdkny49baVUn2SnL4YPvhSfaexVwAAAAAAAAAADCnPNpBm9clOXJwe0Fr7brB7Z8nWbC8b6iqA6tqSVUtWbp06WOMCQAAAAAAAAAA3Rp60KaqVk3yiiRfeuRjrbWWpC3v+1prh7bWFrfWFs+fP/8xBwUAAAAAAAAAgC49mjPa7JHk/Nba9YP711fVwiQZfL1htsMBAAAAAAAAAMBc8WgGbfbLf142KklOSLL/4Pb+SY6frVAAAAAAAAAAADDXDDVoU1VrJnlJkuMesvljSV5SVZcl2XVwHwAAAAAAAAAAJtK8YZ7UWvtlkvUfse2mJLuMIhQAAAAAAAAAAMw1j+bSUQAAAAAAAAAA0FsGbQAAAAAAAAAAYAgGbQAAAAAAAAAAYAgGbQAAAAAAAAAAYAgGbQAAAAAAAAAAYAgGbQAAAAAAAAAAYAgGbQAAAAAAAAAAYAgGbQAAAAAAAAAAYAgGbQAAAAAAAAAAYAgGbQAAAAAAAAAAYAgGbQAAAAAAAAAAYAgGbQAAAAAAAAAAYAgGbQAAAAAAAAAAYAgGbQAAAAAAAAAAYAgGbQAAAAAAAAAAYAgGbQAAAAAAAAAAYAgGbQAAAAAAAAAAYAgGbQAAAAAAAAAAYAgGbQAAAAAAAAAAYAgGbQAAAAAAAAAAYAgGbQAAAAAAAAAAYAgGbQAAAAAAAAAAYAgGbQAAAAAAAAAAYAgGbQAAAAAAAAAAYAhDDdpU1bpVdUxV/bCqLq2qHapqvao6taouG3x98qjDAgAAAAAAAABAV4Y9o80nkpzUWntWkt9OcmmSQ5Kc1lrbPMlpg/sAAAAAAAAAADCRVjhoU1XrJHlRksOSpLV2T2vt1iR7JTl88LTDk+w9mogAAAAAAAAAANC9Yc5os1mSpUk+V1UXVNVnqmrNJAtaa9cNnvPzJAtGFRIAAAAAAAAAALo2zKDNvCTPS/Kp1to2SX6ZR1wmqrXWkrTlfXNVHVhVS6pqydKlS1c2LwAAAAAAAAAAdGKYQZurk1zdWjtncP+YTA3eXF9VC5Nk8PWG5X1za+3Q1tri1tri+fPnz0ZmAAAAAAAAAAAYuxUO2rTWfp7kZ1W1xWDTLkl+kOSEJPsPtu2f5PiRJAQAAAAAAAAAgDlg3pDP+9MkR1TVqkmuSPLmTA3pHF1VByS5KslrRhMRAAAAAAAAAAC6N9SgTWvtwiSLl/PQLrOaBgAAAAAAAAAA5qgVXjoKAAAAAAAAAAAwaAMAAAAAAAAAAEMxaAMAAAAAAAAAAEMwaAMAAAAAAAAAAEMwaAMAAAAAAAAAAEMwaAMAAAAAAAAAAEMwaAMAAAAAAAAAAEMwaAMAAAAAAAAAAEMwaAMAAAAAAAAAAEMwaAMAAAAAAAAAAEMwaAMAAAAAAAAAAEMwaAMAAAAAAAAAAEMwaAMAAAAAAAAAAEMwaAMAAAAAAAAAAEMwaAMAAAAAAAAAAEMwaAMAAAAAAAAAAEMwaAMAAAAAAAAAAEMwaAMAAAAAAAAAAEMwaAMAAAAAAAAAAEMwaAMAAAAAAAAAAEMwaAMAAAAAAAAAAEMwaAMAAAAAAAAAAEMwaAMAAAAAAAAAAEMwaAMAAAAAAAAAAEMwaAMAAAAAAAAAAEOYN8yTqurKJHckWZbkvtba4qpaL8kXk2ya5Mokr2mt3TKamAAAAAAAAAAA0K1Hc0abnVprW7fWFg/uH5LktNba5klOG9wHAAAAAAAAAICJtDKXjtoryeGD24cn2Xul0wAAAAAAAAAAwBw17KBNS3JKVZ1XVQcOti1orV03uP3zJAtmPR0AAAAAAAAAAMwR84Z83gtba9dU1YZJTq2qHz70wdZaq6q2vG8cDOYcmCSbbLLJSoUFAAAAAAAAAICuDHVGm9baNYOvNyT5cpLtklxfVQuTZPD1hmm+99DW2uLW2uL58+fPTmoAAAAAAAAAABizFQ7aVNWaVbXWA7eT7JbkkiQnJNl/8LT9kxw/qpAAAAAAAAAAANC1YS4dtSDJl6vqged/obV2UlX9W5Kjq+qAJFclec3oYgIAAAAAAAAAQLdWOGjTWrsiyW8vZ/tNSXYZRSgAAAAAAAAAAJhrVnjpKAAAAAAAAAAAwKANAAAAAAAAAAAMxaANAAAAAAAAAAAMwaANAAAAAAAAAAAMwaANAAAAAAAAAAAMwaANAAAAAAAAAAAMwaANAAAAAAAAAAAMwaANAAAAAAAAAAAMwaANAAAAAAAAAAAMwaANAAAAAAAAAAAMwaANAAAAAAAAAAAMwaANAAAAAAAAAAAMwaANAAAAAAAAAAAMwaANAAAAAAAAAAAMwaANAAAAAAAAAAAMwaANAAAAAAAAAAAMwaANAAAAAAAAAAAMwaANAAAAAAAAAAAMwaANAAAAAAAAAAAMwaANAAAAAAAAAAAMwaANAAAAAAAAAAAMwaANAAAAAAAAAAAMwaANAAAAAAAAAAAMwaANAAAAAAAAAAAMwaANAAAAAAAAAAAMYehBm6papaouqKqvDe5vVlXnVNXlVfXFqlp1dDEBAAAAAAAAAKBbj+aMNu9IculD7n88yd+21p6Z5JYkB8xmMAAAAAAAAAAAmEuGGrSpqkVJXpbkM4P7lWTnJMcMnnJ4kr1HkA8AAAAAAAAAAOaEYc9o83dJ3pfk/sH99ZPc2lq7b3D/6iQbL+8bq+rAqlpSVUuWLl26MlkBAAAAAAAAAKAzKxy0qaqXJ7mhtXbeY/kBrbVDW2uLW2uL58+f/1j+FwAAAAAAAAAA0Ll5QzznBUleUVV7JlktydpJPpFk3aqaNzirzaIk14wuJgAAAAAAAAAAdGuFZ7RprX2gtbaotbZpktcl+VZr7Q1JTk+y7+Bp+yc5fmQpAQAAAAAAAACgYysctJnB+5O8u6ouT7J+ksNmJxIAAAAAAAAAAMw9w1w66kGttTOSnDG4fUWS7WY/EgAAAAAAAAAAzD0rc0YbAAAAAAAAAADoDYM2AAAAAAAAAAAwBIM2AAAAAAAAAAAwBIM2AAAAAAAAAAAwhHldBwAAAAAAAIBRuvjq27qOMDJbLVqn6wgA0CvOaAMAAAAAAAAAAEMwaAMAAAAAAAAAAEMwaAMAAAAAAAAAAEMwaAMAAAAAAAAAAEOY13UAAAAAAAAAeDz61d1358377pl77vlVli1bll33fEUOfs8Hs/8+e+TOX96RJLn5xhuz5dbPyycO+0LHaQGA2WDQBgAAAAAAAB6DVX/jN/KZL56QNdZ8Uu69997sv8/ueeFOL8nhx5344HPedeAbs9Nue3aYEgCYTS4dBQAAAAAAAI9BVWWNNZ+UJLnvvntz3333pqoefPwXd9yec//1zOz80pd1FREAmGUGbQAAAAAAAOAxWrZsWV790hfmxVtvnh1+b6c8d5vFDz72rZO/nue/YMc8aa21O0wIAMwmgzYAAAAAAADwGK2yyir50sln59Rzv59LLjwvl/3wBw8+duLxx2aPvV7VYToAYLYZtAEAAAAAAICVtPY66+Z3fvf38p0zTkuS3HLzTbnkwvPyop1f2nEyAGA2GbQBAAAAAACAx+Dmm27M7bfdmiS5+6678t0zz8hmz9w8SXLq14/Pi3Z9aX5jtdU6TAgAzLZ5XQcAAAAAAACAx6Mbb/h5/uu7DsqyZcty//0tL/39vbPjrrsnSU464dj80dvf1XFCAGC2GbQBAAAAAACAx+A3n71ljj7prOU+9tkvfX3MaQCAcXDpKAAAAAAAAAAAGIJBGwAAAAAAAAAAGIJBGwAAAAAAAAAAGIJBGwAAAAAAAAAAGIJBGwAAAAAAAAAAGIJBGwAAAAAAAAAAGMIKB22qarWqOreqLqqq71fVRwbbN6uqc6rq8qr6YlWtOvq4AAAAAAAAAADQjWHOaPOrJDu31n47ydZJdq+q7ZN8PMnfttaemeSWJAeMLCUAAAAAAAAAAHRshYM2bcovBnefOPivJdk5yTGD7Ycn2XsUAQEAAAAAAAAAYC4Y5ow2qapVqurCJDckOTXJj5Pc2lq7b/CUq5NsPJKEAAAAAAAAAAAwBww1aNNaW9Za2zrJoiTbJXnWsD+gqg6sqiVVtWTp0qWPLSUAAAAAAAAAAHRsqEGbB7TWbk1yepIdkqxbVfMGDy1Kcs0033Noa21xa23x/PnzVyYrAAAAAAAAAAB0ZoWDNlU1v6rWHdxePclLklyaqYGbfQdP2z/J8SPKCAAAAAAAAAAAnZu34qdkYZLDq2qVTA3mHN1a+1pV/SDJUVX135NckOSwEeYEAAAAAAAAAIBOrXDQprX270m2Wc72K5JsN4pQAAAAAAAAAAAw16zw0lEAAAAAAAAAAIBBGwAAAAAAAAAAGIpBGwAAAAAAAAAAGIJBGwAAAAAAAAAAGIJBGwAAAAAAAAAAGIJBGwAAAAAAAAAAGIJBGwAAAAAAAAAAGIJBGwAAAAAAAAAAGIJBGwAAAAAAAAAAGIJBGwAAAAAAAAAAGIJBGwAAAAAAAAAAGIJBGwAAAAAAAAAAGIJBGwAAAAAAAAAAGIJBGwAAAAAAAAAAGIJBGwAAAAAAAAAAGIJBGwAAAAAAAAAAGIJBGwAAAAAAAAAAGIJBGwAAAAAAAAAAGIJBGwAAAAAAAAAAGIJBGwAAAAAAAAAAGIJBGwAAAAAAAAAAGIJBGwAAAAAAAAAAGIJBGwAAAAAAAAAAGIJBGwAAAAAAAAAAGIJBGwAAAAAAAAAAGMIKB22q6qlVdXpV/aCqvl9V7xhsX6+qTq2qywZfnzz6uAAAAAAAAAAA0I1hzmhzX5L3tNaek2T7JAdX1XOSHJLktNba5klOG9wHAAAAAAAAAICJtMJBm9bada218we370hyaZKNk+yV5PDB0w5PsveIMgIAAAAAAAAAQOeGOaPNg6pq0yTbJDknyYLW2nWDh36eZMHsRgMAAAAAAAAAgLlj6EGbqnpSkmOTvLO1dvtDH2uttSRtmu87sKqWVNWSpUuXrlRYAAAAAAAAAADoylCDNlX1xEwN2RzRWjtusPn6qlo4eHxhkhuW972ttUNba4tba4vnz58/G5kBAAAAAAAAAGDsVjhoU1WV5LAkl7bW/uYhD52QZP/B7f2THD/78QAAAAAAAAAAYG6YN8RzXpDkjUkurqoLB9s+mORjSY6uqgOSXJXkNSNJCAAAAAAAAAAAc8AKB21aa2cnqWke3mV24wAAAAAAAAAAwNy0wktHAQAAAAAAAAAABm0AAAAAAAAAAGAoBm0AAAAAAAAAAGAIBm0AAAAAAAAAAGAIBm0AAAAAAAAAAGAI87oOAIzPxVff1nWEkdlq0TpdRwAAAAAAAGCMHPsCuuCMNgAAAAAAAAAAMASDNgAAAAAAAAAAMASDNgAAAAAAAAAAMIR5XQcAHv/+/D0H59unnZz11p+fL5/23STJnx305lx5xWVJkjtuvy1rrb1OvnTy2V3GBAAAAAAAgBVy7AuYiUEbYKW94tWvz+ve9JZ86J0HPbjtrz71uQdv//VHP5Qnrb12F9EAAAAAAADgUXHsC5iJS0cBK23x9i/IOus+ebmPtdZy8te+kj322nfMqQAAAAAAAODRc+wLmIlBG2CkzjvnX7P+BvPztM2e0XUUAAAAAAAAWCmOfQEGbYCROvH4Y7PHXq/qOgYAAAAAAACsNMe+gHldBwAm13333ZfTTvpqjvrGGV1HAQAAAAAAgJXi2BeQGLQBRuh7Z52RzZ6xeZ6ycOOuowAAAAAAAD1w8dW3dR1hpLZatE7XEXrNsS8gcekoYBa87+AD8sa9d8tVV1yWXX/nOTnuqH9Okpx0wrHZY699O04HAAAAAAAAw3PsC5hJtdbG9sMWL17clixZMrafBzzcJE9xm+AGAAAAAGA6Ph/vj0le68R6P9Ikr/e0a33tBeMNMm4bbdN1AkiSVNV5rbXFy3vMGW0AAAAAAAAAAGAIBm0AAAAAAAAAVuDP33Nwdtz6mXnlLjv82mOHf/rv89ynrptbbr6pg2QAjJNBGwAAAAAAAIAVeMWrX59P/csxv7b959dene+eeXoWbryog1QAjJtBGwAAAAAAAIAVWLz9C7LOuk/+te3/6yMfzLs+9JFUVQepABg3gzYAAAAAAAAAj8HpJ389Gz5lYbZ4zlZdRwFgTOZ1HQAAAAAAAADg8eauu+7MP/7fv8mnjziu6ygAjJEz2gAAAAAAAAA8Sj+78ie55mdX5dUvfWF232GrXH/dtXntHjvmxhuu7zoaACO0wjPaVNVnk7w8yQ2ttS0H29ZL8sUkmya5MslrWmu3jC4mAAAAAAAAwNzxm8/+rXz7wssfvL/7DlvlyK+fkSevt36HqQAYtWHOaPNPSXZ/xLZDkpzWWts8yWmD+wAAAAAAAAAT6X0HH5A37r1brrrisuz6O8/JcUf9c9eRAOjACs9o01o7s6o2fcTmvZK8eHD78CRnJHn/bAYDAAAAAAAAmCv+1ycPm/Hxk7578ZiSANClFQ7aTGNBa+26we2fJ1kw3ROr6sAkBybJJpts8hh/HAAAAAAAMDLXXtB1gtHZaJuuE8wtk7zWifV+KGsNACMxzKWjZtRaa0naDI8f2lpb3FpbPH/+/JX9cQAAAAAAAAAA0InHOmhzfVUtTJLB1xtmLxIAAAAAAAAAAMw9j3XQ5oQk+w9u75/k+NmJAwAAAAAAAAAAc9O8FT2hqo5M8uIkG1TV1Un+IsnHkhxdVQckuSrJa0YZEgCA5OKrb+s6wshstWidriMAjJz9OADA3DLJ/VmiR4PeufaCrhOMzkbbdJ0A4GFWeEab1tp+rbWFrbUnttYWtdYOa63d1FrbpbW2eWtt19bazeMICwAAAAAAPH780bs/nA2fu0u23PnVD267+Zbb8pLXHZTNX7BXXvK6g3LLrbd3F5BZZb37w1oD0GeP9dJRAAAAAAAAM3rTa34/Jx3xfx+27WOf/Fx2eeF2uew7x2eXF26Xj33ycx2lY7ZZ7/6w1gD02QovHQUAMOdM8mlQE6dCpb8m+bXtdQ0AQE+9aPttc+XPrn3YtuNP/nbOOObQJMn+r355Xrzvgfn4h97RRTxmmfXuD2sNjNskX/LR5R4ff5zRBgCAh/n8YZ/KK3fZIa/cZfv8y2f+oes4ADxK9uMAwFx3/Y03ZeGC+UmSp2y4Qa6/8aaOE41en3u0Pq53X1nrfvjEZ76QLXd+dX5rp33zd/94RNdxADph0AYAgAdd9sMf5Ngv/HO+8LXT8qWTz86Zp52cn/7kiq5jATAk+3EA4PGmqlJVXccYKT3af+rDejPFWk+mS354ef7xC1/OuV//51x06lH52jfPyuU/+WnXsQDGzqWjem6ST7GVOM0WADxaP7n8P/LcbbbN6quvkSRZ/PwX5JsnfTV/dFBHp/md5EsJJdNeTkiPBjxWc24/DgCPQ/rx0Vuwwfq57vqlWbhgfq67fmk2XH+9riONVN97tL6td59Z68l36WU/yfO32TJrrL56kmTH7bfNcSd+K+97+5u6DQYwZs5oAwAwjb899PP5rZ32zZY7vzr7vf0DufvuX3UdaeSeucWzc/65382tt9ycu+66M2edfmquv/bqrmPBrOrja5v+sB8HAB4PXrHbi3L4l76WJDn8S1/LXi/dseNEo9X3Hq1v691n1nrybfmsZ+Sscy7ITTffmjvvuivf+NbZ+dm113cdC2bVv/zjJ/PKXbbPK3fZIe87+ID86u67u47EHGTQBgBgOa657ob8n88elSXf+Hwu+daXsmzZ/Tnq+JO7jjVyT998i7z57e/IW9/wyhz0B6/KFs/ZKk9YZZWuY8Gs6etrm/6wHwcA5pr93v6B7PCKN+VHP74qi7bdPYcd+ZUccvCbc+qZ38vmL9gr3zzrnBxy8Ju7jjlSferRrHd/WOt+evbmT8/7D35Tdnv927P7G/4kW//WFlnlCQ43Mzmuv+7aHPG5T+fIr52eL5/23dx//7KcdMKxXcdiDnLpKACAadx337Lcdfev8sQnzsudd92VjZ4yv+tIY7HP6/4w+7zuD5Mkn/jYR7Ng4UYdJ4LZ1dfXNv1hPw4AzCVH/sP/XO72047+9JiTdKsvPZr17g9r3V8H7Ld3Dthv7yTJB//n32fRwgXdBoJZtuy+ZfnV3Xdn3hOfmLvvuivzFyzsOhJzkEEbgAk0ydcSn/E64tdeML4g47bRNsvdPMlrnXR73fiNF26Y977tjdlkuz2z+mq/kd123CG77bhDZ3nG6aYbl2b9Debnumt+ltNO+mo+f/ypXUeCWTPXXtu93Y9P8nt2Mu379rjYj3evt69t+sN+HOBR06MBk+KGG2/Ohhusl59ec12OO/H0fO+rh3cdCWbNgoUbZf+3/kl2237LrLbaatnhRTvnd3fcuetYzEEGbRiLn/z4srzv7f95isCrf3pV3v6eD+SNf/z2DlMBwPRuufX2HH/yGfnJ976Wddd+Ul791vfn88d+PX/wqpd1HW3k3n3gH+a2W2/OvHnz8sH//tdZe511u440cj+6/Mq89qBDHrx/xU+vyUff+7a88y1v6DDVaPW1P+vza5v+6ON+HAAeb/raj/eZHg2YFK96y3tz0y235Ynz5uWT/+P9WXedtbqOBLPm9ltvzemnfCMn/utFWWvtdfLet+2frx33xbx8n9d2HY05xqANY7HZMzbPl04+O0mybNmy7Po7z84uu7+841QAML1vnnVONttk48xf/8lJkn322Dn/uuTfe3Ew/vDjTuw6wtht8cxNc+GpRyWZ6lU23nb3vHKPnTpONVp97c/6/NqmP/q4HweAx5u+9uN9pkcDJsVZX/5s1xFgZL539hlZ9NSnZb31N0iS7LLH7+fCJecatOHXGLRh7M45+9t56tM2y0aLNukuxCSf4tjpjR9uktc6sd4wQpts/JR87/yLc+ddd2X11VbLaWefm8W//ZyuYzEGp519bp7xtEV52qKNuo4yNnOiPxsTr20YM/14v0zyeruUK5ns9Xa5x271qR8HAJjLnrLxovz7BUty1113ZrXVVs853/l2fuu5c6NnZG55QtcB6J+TTjg2e+z1qq5jALPg9ttuzbvf+od5xYt/J3vttF0uOu/criONxa233ZF93/JnedaL9smzd9wn311yUdeRxqJv6/38522VfV+2S5730jdkq11ek/vvvz8HvmGfrmMxBkcdf3L22/ulXccYqz71Z31+bfdtP5709z2b/ujj65p+sR+nL/rUjwMAzGXP3WZxdt3zFXntHjtmn11/N+3++7Pv69/UdSzmIGe0YazuveeenHHqiXnHIX/RdRRgFnz8w4fkBS/eNX/z6X/Ovffck7vuurPrSGPxjj//q+y+0+/mmH/8q9xzz7258667u440Fn1c74+896B85L0HdR2DMbrnnntzwiln5n9+4E+7jjI2fezP+vra7uN+vK/v2fRHH1/X9Iv9OH3Qx34cAGAuO/g9H8zB7/lg1zGY45zRhrE6+/RT8+wtfzvrz9+w6yjASrrj9tty3jn/mn1e98YkyRNXXTVrr7Nut6HG4Lbb78iZ55yfA/bbO0my6qpPzLrrrNVtqDHo63rTPyee/p08b6tnZcH89buOMjb6s37o4368r+/Z9EcfX9f0i/04faEfBwCAxx9ntGGsTjy+X6dB3fT5L8taT1ozqzzhCZk3b5UsOfGIriMxQn1b72t+dlXWW2+D/Ld3vz3/ceklefZWW+f9H/lY1lhjza6jjdRPfnpt5q//5Lz5XR/ORT/4j2z73GfnEx/9s6y5xupdRxupvq43/XPkV07q3WWj+taf9VUf9+N9fc/uM/345L+uH9C3tX7A7jtslTXWXCurrPKErLLKvBz1jTO6jjRSfd6P922tH9DX17Z+HAAAHn+c0YaxufPOX+a7Z52eXfb4/a6jjNXpX/p0Ljz1qN58ONB3fVrvZfcty6WXXJTX/OEBOfqks7L6Gmvks5/8265jjdx9y5bl/It/mIP+cN9ccMqRWXON1fOx//u5rmONXF/Xm3755Z135dQzz8k+e+zcdZSx6Wt/1kd93I/39T277/Tjk/26fqg+rfVDHXb0V/Olk8/uxeBF3/fjfVrrh+rba1s/DgAAj08GbRibNdZYM2dd/JOstfY6XUcBZsGChRtlwcKN8txtFidJXrLnXrn0kn/vONXoLVq4YRYt3DDPf95WSZJ9X7ZLzr/4hx2nGr2+rjf9suYaq+em75+eddbuzyUJ9Gf90cf9eF/fs+mPPr6u6Rf7cfpAPw4AAI9PBm1ghKoqu+13cLbd/fU59PPHdh2HEevbem+w4YIsWLgoP/nxZUmSc77z7Tx98y06TjV6T9lwgzx1owX50eVXJklOO/vcPOc3N+s21Bj0db0BJkUf9+N9fc/uM/345L+uH9C3tX5QVd76hlfmtXvumGOO+Keu04xcr/fjPVvrB/T2tQ0AADzuzOs6AEyys7/82Wy8cMPccOPNecnrDsqznrlpXrT9tl3HYkT6uN4f+MuP5wN/+pbce+89WbTJpvnL//0PXUcai7//y/fnDX/6odxz7715+iaL8rm/+XDXkcair+sNMCn6uB/v63t2X+nH+/G6Tvq51kly+LEnZcHCjXLTjUvz1tfvnU2fsXkWb/+CrmONVF/3431c66S/r20AAODxx6ANjNDGCzdMkmy4wXp55R475dwLv+8DggnWx/V+1m89t3fXi0+SrbfcojfXi3+ovq43wKTo4368r+/ZfaUf748+rnUydbmwJFl/g/nZefeX55ILz5/44Yu+7sf7uNZJf1/bAADA449LR8GI/PLOu3LHL3754O1Tvv29bLnFMzpOxahYbwAA6I5+vD/6utZ33vnL/PIXdzx4+7tnnp5nbvHsjlMxCn1d676+tgEAgMcnZ7SBEbl+6U155QHvSZLct2xZXr/37tl9p8n/66O+st4AANAd/Xh/9HWtb166NO98yxuSJMuWLcsee+2bF+60a8epGIW+rnVfX9sAAMDjk0EbGJGnP21RLvrmF7uOwZhYbwAA6I5+vD/6utaLnrZpjjnlO13HYAz6utZ9fW0DAACPTyt16aiq2r2qflRVl1fVIbMVCgAAAAAAAAAA5prHPGhTVask+WSSPZI8J8l+VfWc2QoGAAAAAAAAAABzycqc0Wa7JJe31q5ord2T5Kgke81OLAAAAAAAAAAAmFtWZtBm4yQ/e8j9qwfbAAAAAAAAAABg4swb9Q+oqgOTHDi4+4uq+tGofyZz2gZJbuw6RAf6WHcfa07U3Sd9rDlRd5/0seZE3X3Sx5oTdfdJH2tO1N0nfaw5UXff9LHuPtacqLtP+lhzou4+6WPNibr7pI81J+qmv5423QMrM2hzTZKnPuT+osG2h2mtHZrk0JX4OUyQqlrSWlvcdY5x62Pdfaw5UXfXOcapjzUn6u46xzj1seZE3V3nGKc+1pyou+sc49THmhN1d51jnPpYc6LurnOMWx/r7mPNibq7zjFOfaw5UXfXOcapjzUn6u46xzj1seZE3V3nYG5amUtH/VuSzatqs6paNcnrkpwwO7EAAAAAAAAAAGBuecxntGmt3VdVf5Lk5CSrJPlsa+37s5YMAAAAAAAAAADmkJW5dFRaa99I8o1ZykI/9PUyYn2su481J+rukz7WnKi7T/pYc6LuPuljzYm6+6SPNSfq7pM+1pyou2/6WHcfa07U3Sd9rDlRd5/0seZE3X3Sx5oTdcOvqdZa1xkAAAAAAAAAAGDOe0LXAQAAAAAAAAAA4PHAoA1jU1W7V9WPquryqjqk6zzjUFWfraobquqSrrOMS1U9tapOr6ofVNX3q+odXWcah6pararOraqLBnV/pOtM41JVq1TVBVX1ta6zjEtVXVlVF1fVhVW1pOs841JV61bVMVX1w6q6tKp26DrTKFXVFoM1fuC/26vqnV3nGoeqetdgX3ZJVR1ZVat1nWkcquodg5q/P8lrvbz+pKrWq6pTq+qywdcnd5lxtk1T86sHa31/VS3uMt+oTFP3Xw324/9eVV+uqnU7jDgS09T9l4OaL6yqU6pqoy4zzraZfu+oqvdUVauqDbrINkrTrPWHq+qah7x/79llxlGYbr2r6k8Hr+/vV9X/6irfKEyz1l98yDpfWVUXdhhxJKape+uq+t4Dv4tU1XZdZhyFaer+7ar67uD3sK9W1dpdZpxt032W0oMebbq6J7ZPm6Hmie7RZqh70nu0GT8nncQ+bYa1nugebaa1nvAebbr1ntg+bYaaJ7pHm6HuSe/Rlnvcp6o2q6pzaup45xeratWus86WGWr+k0G9E/W+9YAZ6j6ipo5tX1JTv6c8seuszB0uHcVYVNUqSf4jyUuSXJ3k35Ls11r7QafBRqyqXpTkF0n+ubW2Zdd5xqGqFiZZ2Fo7v6rWSnJekr17sNaVZM3W2i8Gb7RnJ3lHa+17HUcbuap6d5LFSdZurb286zzjUFVXJlncWrux6yzjVFWHJzmrtfaZwS8Pa7TWbu041lgM3seuSfL81tpVXecZparaOFP7sOe01u6qqqOTfKO19k/dJhutqtoyyVFJtktyT5KTkryttXZ5p8FGYHn9yeDDvptbax+rqYHoJ7fW3t9lztk0Tc3PTnJ/kk8neW9rbeIGJ6epe7ck32qt3VdVH0+SSVrrZNq6126t3T64/V8ytY97W4cxZ9V0v3dU1VOTfCbJs5JsO2m9yzRr/eEkv2it/XWX2UZpmrp3SvKhJC9rrf2qqjZsrd3QZc7ZtKLfravqfye5rbX20bGHG6Fp1vqUJH/bWjtxcJDyfa21F3cYc9ZNU/e/Zer9+ttV9UdJNmut/bcuc86m6T5LSfKmTHaPNl3dLRPap81Q86JMcI82Q91XT3iPNu3npJPap82w1q/JBPdoM9S9IJPdo63wWMCk9WkzrPXfZYJ7tBnqPjyT3aMt97hPkncnOa61dlRV/b8kF7XWPtVl1tkyQ82/SnJLkjMygcdGZqh7vSQnDp72hSRnTspas/Kc0YZx2S7J5a21K1pr92TqQNZeHWcaudbamUlu7jrHOLXWrmutnT+4fUeSS5Ns3G2q0WtTfjG4+8TBfxM/yVhVi5K8LFMfDDDBqmqdJC9KcliStNbu6cuQzcAuSX486UM2DzEvyepVNS/JGkmu7TjPODw7yTmttTtba/cl+XaSfTrONBLT9Cd7ZerDkQy+7j3OTKO2vJpba5e21n7UUaSxmKbuUwb/xpPke5k6qDNRpqn79ofcXTMT1qfN8HvH3yZ5Xyas3gf08fetZNq6D0rysdbarwbPmZgDOMnMaz34QPQ1SY4ca6gxmKbuluSBvxReJxPYp01T928mOXNw+9QkrxprqBGb4bOUSe/Rllv3JPdpM9Q80T3aDHVPeo820+ekE9mn9fiz4enqnvQebcb1nsQ+bYaaJ7pHm6HuSe/Rpjvus3OSYwbbJ6pHm67m1toFrbUru0s2WjPU/Y3BYy3JuZmwHo2VY9CGcdk4yc8ecv/q9KDB7ruq2jTJNknO6TjKWNTUJZQuTHJDklNba32o++8y9aHA/R3nGLeW5JSqOq+qDuw6zJhslmRpks/V1KXCPlNVa3Ydaoxelwn6UGAmrbVrkvx1kp8muS5Tf3V0SrepxuKSJL9XVetX1RpJ9kzy1I4zjdOC1tp1g9s/z9Rf3TH5/ij/+Vc5E6+q/kdV/SzJG5L8edd5Rq2q9kpyTWvtoq6zdOBPauoyFJ+tCbvMygx+M1PvY+dU1ber6ne6DjRGv5fk+tbaZV0HGZN3Jvmrwf7sr5N8oNs4Y/P9/OcfbL06E9ynPeKzlN70aH37DCmZseaJ7tEeWXdferSH1t2XPm05/8Z70aM9ou7e9GjT7NMmuk97RM3vTE96tEfUPfE92iOP+yT5cZJbHzIgO3HHO3t6rGvGugdnuXljps6CDkkM2gAjUlVPSnJsknc+4q9TJlZrbVlrbetMTbRuN7gMycSqqpcnuaG1dl7XWTrwwtba85LskeTgwanNJ928JM9L8qnW2jZJfpnkkG4jjUdNXSbrFUm+1HWWcRh82LVXpoarNkqyZlX9QbepRq+1dmmSjyc5JVO/MF2YZFmXmboy+AuNifqrSn5dVX0oyX1Jjug6y7i01j7UWntqpmr+k67zjNJgYPCDmeCDVTP4VJJnJNk6UwOj/7vTNOMzL1OntN4+yZ8lOXrwF8R9sF96MhA9cFCSdw32Z+/K4IyTPfBHSd5eVeclWStTl/qcODN9ljLJPVofP0OaruZJ79GWV3cferSH1p2p9Z34Pm05a92LHm05dfeiR5thPz6xfdpyau5Fj7acuie+R3vkcZ9MXfJvovXtWNcDVlD3P2TqslFndRKOOcmgDeNyTR4+ybposI0JNJjsPDbJEa2147rOM25t6nI6pyfZveMoo/aCJK+oqiszdTm4navq891GGo/BGT8eON3rlzPVYE+6qzN17fQHpriPydTgTR/skeT81tr1XQcZk12T/KS1trS1dm+S45L8bseZxqK1dlhrbdvW2osydc3h/+g60xhdP7je9gPX3Z6o01nzcFX1piQvT/KGwUG7vjkiE3Y66+V4RqYGJi8a9GqLkpxfVU/pNNUYtNauH3w4dn+Sf0w/+rRkqlc7bnBG63MzdcbJDTrONHKDy1zuk+SLXWcZo/0z1Z8lU4Pgvfg33lr7YWttt9batpk6YPfjrjPNtmk+S5n4Hq2PnyFNV/Ok92hDrPVE9mjLqXvi+7TlrXUferRp/o1PfI82wz5tYvu0aWqe+B5tmtf2xPdoD3jIcZ8dkqw7+DeeTPDxzh4d63qYR9ZdVX+RZH6Sd3cYiznIoA3j8m9JNq+qzQZnBnhdkhM6zsQIDCbyD0tyaWvtb7rOMy5VNb+q1h3cXj3JS5L8sNNQI9Za+0BrbVFrbdNMvaa/1Vqb+LNeVNWaVbXWA7eT7JapS85MtNbaz5P8rKq2GGzaJckPOow0ThP71zfT+GmS7atqjcE+fZdMXXd54lXVhoOvm2Tqw6AvdJtorE7I1IdCGXw9vsMsjFBV7Z6pyz6+orV2Z9d5xqWqNn/I3b0y+X3axa21DVtrmw56tauTPG/wfj7RHjggPfDK9KBPG/hKkp2SpKp+M8mqSW7sMtCY7Jrkh621q7sOMkbXJtlxcHvnJBN5KYZHekif9oQk/zXJ/+s20eya4bOUie7R+vgZ0nQ1T3qPNkPdE92jLa/uSe/TZljrie7RZtiffSUT3KOtYD8+kX3aDDVPdI82w2t70nu05R33uTRTQxj7Dp42UT1aH491JdPXXVV/nOSlSfYbDIvCg2oCh+OZo6pqzyR/l2SVJJ9trf2PbhONXlUdmeTFmZpSvz7JX7TWJvKUgQ+oqhcmOSvJxZma0E+SD7bWvtFdqtGrqucmOTxT/76fkOTo1tpHu001PlX14iTvba29vOMoI1dVT8/UWWySqdO/fqEP+7Mkqaqtk3wmUx8KXJHkza21WzoNNWKDYaqfJnl6a+22rvOMS1V9JMlrM3VK6wuS/HFr7Vfdphq9qjoryfpJ7k3y7tbaaR1HGonl9SeZ+vDv6CSbJLkqyWtaazd3FHHWTVPzzUn+PlN/kXJrkgtbay/tKOJITFP3B5L8RpKbBk/7XmvtbZ0EHJFp6t4zyRaZ6k+vSvK2B85QNwlW9HvH4K+lF7fWJuZD/WTatX5xpi5J0JJcmeStrbXrOgk4ItPU/S9JPpup2u/JVG/+rY4izrrp/o1X1T9laj82UR/oP2Catf5Rkk9k6neRu5O8vU3Y5XynqftJSQ4ePOW4JB+YpDN+TPdZSpJzMtk92nR1/0YmtE+boeb/kwnu0Wao+4BMdo+2ws9JJ61Pm2Gt98sE92gz1P3NTHaPNu2/8Unt02ZY69szwT3aDHVvnsnu0ZZ73GdwjOCoTF0a7oIkfzApn53OUPN/ydRQ8FMydZbFb7TW/ri7pLNrhrrvy1SPcsfgqcf16dgfMzNoAwAAAAAAAAAAQ3DpKAAAAAAAAAAAGIJBGwAAAAAAAAAAGIJBGwAAAAAAAAAAGIJBGwAAAAAAAAAAGIJBGwAAAAAAAAAAGIJBGwAAAAAAAAAAGIJBGwAAAAAAAAAAGIJBGwAAAAAAAAAAGML/BxO0F2/Sbop/AAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 18.3 s, sys: 105 ms, total: 18.4 s\n", "Wall time: 18.3 s\n" ] } ], "source": [ "%%time\n", "import hdbscan\n", "import matplotlib.pyplot as plt\n", "from collections import Counter\n", "from itertools import compress\n", "\n", "from collections import Counter\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import matplotlib.colors as mcolors\n", "from matplotlib.pyplot import figure\n", "\n", "\n", "# HDBSCAN clustering\n", "print('【HDBSCAN】Clustering ...',end='')\n", "hclusterer = hdbscan.HDBSCAN(prediction_data=True).fit(embeddings) #embeddings_list\n", "print('DONE!\\n','-'*80,'\\nNoise ratio:',round(list(hclusterer.labels_).count(-1) / len(embeddings),3)*100,'% ',list(hclusterer.labels_).count(-1),'/',len(embeddings))\n", "\n", "\n", "# approximate predict cluster to find target domain\n", "# you can customize what you want to predict\n", "predict_doc = sample(data[data['news_content'].str.contains('幸福空間')]['news_content'].tolist(), 1)[0].replace('\\n','')\n", "print('...',predict_doc[50:350], '...\\n','-'*80)\n", "test_labels, strengths = hdbscan.approximate_predict(hclusterer, model.encode([predict_doc]))\n", "target_domain_cluster = test_labels[0]\n", "print('Predict_doc (target domain) is predicted to be in cluster #',target_domain_cluster,end='')\n", "if target_domain_cluster == -1:\n", " temp = Counter(hclusterer.labels_)\n", " del temp[-1]\n", " target_domain_cluster = max(temp.items(), key=operator.itemgetter(1))[0]\n", " print(' (noise)\\n -> Replace with the largest cluster #',target_domain_cluster)\n", "\n", "\n", "\n", "\n", "fig, ax = plt.subplots(figsize=(40, 5), dpi=72)\n", "labels, values = zip(*sorted(Counter(hclusterer.labels_[hclusterer.labels_!=-1]).items()))\n", "indexes = np.arange(len(labels))\n", "width = 0.9\n", "bars = ax.bar(indexes, values, width, alpha=0.2, color = [[color for name, color in mcolors.TABLEAU_COLORS.items()][i%2] for i in sorted(Counter(hclusterer.labels_[hclusterer.labels_!=-1]))])\n", "plt.xticks(indexes, labels)\n", "ax.bar_label(bars, fmt='%d', padding=-14)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 9, "id": "e40ae708", "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "num of news in the cluster # 31 : 77\n", "news in the cluster # 31 :\n", "\n", "... 主打29坪3房,超強坪效掀起網路話題,網友好奇如何規劃多房格局;專家指出,竹圍街的大樓普遍超過20年,平均單價34.3萬元,周邊新案稀少,新案開價每坪45~58萬元,該案的價格合理,尤其區域商家林立,還被綠地公園環抱,地段良好能彌補狹小空間的的特點,仍可吸引小資族青睞。\n", "\n", "民眾上網分享後竹圍街某建案廣告,廣告寫明29坪規劃3房,還主打文教公園宅,售屋平台更以小家庭首選當作賣點,引發作者質疑「真的適合家庭嗎?」文章吸引網友矚目,多數觀點直指空間狹小「不太適合成家」,亦有緩頰者認為「沒入住前不宜妄下評論」,網路看法不一。\n", "\n", "《蘋果》對比發現,建案基地位於竹圍街198號旁,面積138.24坪,規劃地 ...\n", " --------------------------------------------------------------------------------\n", "... 人口的改變,小宅已成常態,日前有網友在《買房知識家(A你的Q)》發問,對於兩大一小的家庭成員來說,住室內坪數18~20坪,能隔成3房嗎?原po認為,現在房價這麼貴公設比這麼高,能買到這樣的坪數已經不簡單,但這樣的空間住久了會不會後悔?要是有長輩來住就真的塞不下了。\n", "文章曝光後,釣出同樣買不起大房的人共鳴,認為20坪想規劃3房還是做得到,頂多每個使用空間變小而已,先求有再求好比較重要,「一定有一間房會比較小,當客房還行」、「收納跟裝潢要很吃重 小孩房要用床架才方便」、「看是重視客廳還是房間,這個坪數硬要三房,空間必須犧牲一個」、「偶爾長輩來的時候,就睡小孩房,小孩跟你們睡不用為了偶爾的客人多花幾 ...\n", " --------------------------------------------------------------------------------\n", "... 可能利用室內空間,會將小坪數空間分隔成2至3房,但卻可能導致客廳的部分沒有採光。就有網友在臉書《買房知識家(Q你的A)》發文詢問,「客廳無採光的房子,建議買嗎?」\n", "貼文曝光後,許多網友點出客廳無採光缺點,甚至可能影響轉手價格,「明廳暗房,古有明訓」、「客廳沒有通風採光,感覺幽閉感很重」、「通常明廳暗房,但現代人明廳明房,這叫採光好。客廳無採光如不介意可以自住,但往後要賣不管買方介不介意(不介意也會說不好)都是議價題材,請三思」、「住過這種房子,如果又是朝北的,真的是,慘」、「暗廳嗎?不考慮哦!客廳明亮很重要捏!你可以搜一下,通常比較難賣的房子,除了樓層或屋齡屋況等基本的條件不佳之外,還有一個就 ...\n", " --------------------------------------------------------------------------------\n", "... 會找很多設計師進行比價,不過,每個設計師的設計風格、用材、工期等都不同,到底該如何正確計算,才能找出符合自己預算和品質的設計師呢?\n", "翔翰室內裝修設計總監盧淑媛表示,其實比價是最具爭議的事情,因為牽扯範圍很廣,不管是材料的內容、品質施工方式、粗細度、收法等等都不同,而且有些材料看似相同,但很多卻是從東南亞進口的,會有很大的價差,所以比價並沒有一個基準點。\n", "?首購必看!買房「隱性成本」明細大公開!\n", "?雙北買氣最旺捷運站出爐!「他」第一!\n", "? 首購要買哪?萬名網友狂推「這區」\n", "?1字頭熱門宅都在這!2021恐成絶響\n", "盧淑媛分享,好的設計師能依消費者習慣及想法完成裝修,能帶著愉悅的心情入住。示意圖/p ...\n", " --------------------------------------------------------------------------------\n", "... 民眾在《買房知識家》發文表示,最近看了一棟15樓高大樓,房子四周是水泥外,內部三房二衛全都採輕隔間,因此想詢問這樣的設計會不會有安全疑慮?買了會後悔嗎?\n", "文章曝光後,釣出過來人也對於輕隔間帶來的缺點感到相當困擾,「隔音超差,年輕夫妻晩上在幹麻,隔壁都知道」、「越高樓隔音越差」、「用輕隔間來隔房子內牆,售價也沒有比較便宜」、「新大樓還沒交屋就開始一堆裂,這是要怎麼自圓其說?猜想是因為在隔板交接處沒用上防裂膠帶或是防裂網,造成只要有小小地震就整面牆面佈滿細紋」,也有網友表示「之前牆是承重結構,一般用紅磚,隔音有比較好」、「輕隔間有不同的材質與工法,使得隔音防潮效果和在牆上掛重物的承受力,各有不同」 ...\n", " --------------------------------------------------------------------------------\n", "... 財力相對薄弱的首購族。觀察北中南2~3房主力推案區域,北台灣總價約落在1000~2000萬元,中部總價約在1000~1500萬元,南部多在千萬元內就可入手優質小宅大樓,不過專家提醒,首購熱區往往也有較多投資買盤,購屋時務必留意未來轉手性。\n", "\n", "觀察近年推案,2~3房的佔比極高,《591新建案》總編輯李忠哲表示,這兩年2、3房的推案,佔比高達8成,屬於市場主流,原因是建商以下修坪數、壓低總價的方式,利於推升銷售機會,鎖定的主力客層即為剛性需求的首購族群。\n", "\n", "甲桂林廣告業務總經理陳衍豪表示,2、3房因坪數較低,通常被視為換屋過渡期住宅,必須評估未來脫手難易度,比較推薦距離捷運1公里內、總價1500萬 ...\n", " --------------------------------------------------------------------------------\n", "... 型物件,建商近年來更是狂推小坪數、低總價建案。但低總價的同時,通常也伴隨著較高的單價,也讓民眾疑問中古的中大坪物件未來是否仍有補漲空間。專家表示,非豪宅的中大坪數物件相對下較不討喜,而小坪數物件預估幾年後單坪售價應該會再下跌。\n", "近日有網友在PTT詢問「中大坪數的未來趨勢」。該網友表示,目前小坪數當道,建商狂蓋權狀30-40坪的小三房,明明住起來很不舒服,因低總價仍舊很熱門。新房每坪40萬40坪權狀,明明再加一點預算,就有10年屋每坪25萬80坪權狀,雖然稍微折舊了點,但空間大超多,認為中小坪單價被市場炒到過高,而中大坪理論上應該要再補漲,但釋出的物件卻常常是賠售。\n", "㊙想買保值宅?買房選「這」最 ...\n", " --------------------------------------------------------------------------------\n", "... 不再那麼迫切。而目前養老服務體系建立的更加健全,年輕人更願意自己居住,而不是和父母輩住在一起。加上思想的轉變,頂客、不婚族的數量也日益龐大,因此小房子也將愈來愈搶手。專家表示,「小坪數房型」是當前的時勢所趨,未來這類型的房子不僅受歡迎,保值性更是無庸置疑。\n", "根據內政部資料顯示,雙北房市第二季交易坪數占比中,20~35坪(土地+建物)占36.9%,35~50坪占20.7%,新北市房市,20~35坪(土地+建物)占43.5%,35~50坪占23.5%,可見小坪數物件仍占市場大宗。因此專家分析,「小坪數房型」儼然已成為現今必備的產品與趨勢,而其中也存在一定優勢,才能穩住市場,吸引買家入手。\n", "?台北千 ...\n", " --------------------------------------------------------------------------------\n", "... 悅目,往往別出心裁地想出各種新穎的設計來滿足自己的視覺享受,但其實有時候小資族為了節省預算,材質上卻變得不是那麼講究,導致用沒多久就毀損才來後悔,專家表示,有些裝潢設計並不耐用也不一定適合台灣氣候,尤其各地濕氣普遍均高,所以材質的選擇上必須特別注意。\n", "室內設計師盧淑媛表示,一般家庭最常用來布置牆壁的方式,不外乎是壁紙、壁布或油漆粉刷,但其實面對台灣濕氣重的環境,尤其是靠山、靠海或港口附近等濕氣更劇的地區,壁紙、壁布並不推薦,因空氣水分子含量多,較易造成黏貼處翹起,且容易有塵蟎甚至因長期潮濕引起壁癌導致壁紙發霉的慘狀。\n", "?台北千萬退休宅怎麼選?曝優劣分析\n", "?雙北買氣最旺捷運站出爐!「他」第一!\n", " ...\n", " --------------------------------------------------------------------------------\n", "... 起兩房,不少預售屋更是一釋出就被秒殺,反觀中大坪數的中古屋乏人問津。\n", "一名網友在PTT表示,目前小坪數當道,新建案幾乎都是權狀30至40坪的小三房,雖然他認同低總價的確很吸引人,但明明再多加一點預算就可以買到更大的中古屋,「只是稍微折舊了點,空間大很多」,為何大家寧願選擇新成屋?難道中大坪數被低估了嗎?\n", "?首購必看!買房「隱性成本」明細大公開!\n", "?雙北買氣最旺捷運站出爐!「他」第一!\n", "? 首購要買哪?萬名網友狂推「這區」\n", "?1字頭熱門宅都在這!2021恐成絶響\n", "低總價是小坪數房屋熱門的主要原因。示意圖/photoAC\n", "許多網友都持有相同看法,「大坪數主要是管理費貴跟大排氣量的車子二手價很慘類似 ...\n", " --------------------------------------------------------------------------------\n", "CPU times: user 4.26 ms, sys: 3.88 ms, total: 8.14 ms\n", "Wall time: 6.12 ms\n" ] } ], "source": [ "%%time\n", "\n", "from tqdm import tqdm\n", "from random import sample\n", "\n", "\n", "# check news in the target domain cluster or other cluster\n", "cluster_num = target_domain_cluster\n", "fil = [l==cluster_num for l in hclusterer.labels_]\n", "doc_list = list(compress(data['news_content'].tolist(), fil))\n", "\n", "print('num of news in the cluster #',cluster_num,':', len(doc_list))\n", "print('news in the cluster #',cluster_num,':\\n')\n", "for d in sample(doc_list,min(10,len(doc_list))):\n", " print('...',d[50:350], '...\\n','-'*80)" ] }, { "cell_type": "code", "execution_count": 10, "id": "18b801b7", "metadata": {}, "outputs": [], "source": [ "def find_tags(doc_list, customized_stopwords = customized_stopwords):\n", " tag_list=[]\n", " pbar = tqdm(range(len(doc_list)))\n", " pbar.set_description(\"[Extracting keywords...]\")\n", "\n", " fail_count = 0\n", "\n", " for d in doc_list:\n", " pbar.update()\n", "\n", " try:\n", " result = obj.extractKeywordFromString(d) #, num_kw=6\n", " except:\n", " result = 'a'\n", " fail_count +=1\n", " print('-'*80)\n", " print('Keywords of this news are not available:\\n','...',d[50:250], '...\\n','-'*80)\n", "\n", "\n", " tags = list(filter(lambda x: len(x)>1 and x not in list(set(customized_stopwords)), result))\n", " tag_list.extend(tags)\n", " tag_list = list(set(tag_list))\n", "\n", " pbar.close()\n", " \n", " print('Num of keywords:', len(tag_list))\n", " print('Fail:', fail_count,'\\n')\n", " \n", " return tag_list" ] }, { "cell_type": "code", "execution_count": 11, "id": "3d384c2d", "metadata": { "scrolled": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[Extracting keywords...]: 20%|██ | 2/10 [00:00<00:00, 12.25it/s]" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Num of news in the cluster # 30 #31 #32 : 102\n", "News in the cluster # 30 #31 #32 :\n", "\n", "... 讓售屋民眾猶豫的一大問題。近日有網友在PTT發文詢問空屋和裝潢屋,哪個比較好賣。若在房子出售前先裝潢過,是否真的會較容易賣出或影響出售行情。對此,專家表示,這點因人而異。\n", "許多網友表示是否會變比較好賣要看裝潢品味決定,「有些裝潢很個人風,你買到又要拆」、「看你裝潢後要加多少錢,和你國中美術課有沒有翹課」、「櫃子做太多的裝潢很不好」、「請不要亂用顏色,請只用黑白灰」、「裝潢最悲劇就是色系亂搭,穩死; ...\n", " --------------------------------------------------------------------------------\n", "... 年重劃區的發展日益茁壯,也有很多人會選擇重劃區做為看房買房的首選,最廣為受到討論的就有好幾個,包括桃園青埔、新店央北、板橋江翠北、新莊副都心以及頭前重劃區。有人在社團中問到,自己的預算約1200萬,想在頭前重劃區買不附停車位小坪數2房,不少網友紛紛回應原PO,這價格選擇很少,就算有,也會被秒殺。\n", "\n", "原PO在新莊地方社團問網友是否有推薦的房屋物件、推薦建案,預算1200萬想在頭前重劃區買小坪數2房, ...\n", " --------------------------------------------------------------------------------\n", "... 不再那麼迫切。而目前養老服務體系建立的更加健全,年輕人更願意自己居住,而不是和父母輩住在一起。加上思想的轉變,頂客、不婚族的數量也日益龐大,因此小房子也將愈來愈搶手。專家表示,「小坪數房型」是當前的時勢所趨,未來這類型的房子不僅受歡迎,保值性更是無庸置疑。\n", "根據內政部資料顯示,雙北房市第二季交易坪數占比中,20~35坪(土地+建物)占36.9%,35~50坪占20.7%,新北市房市,20~35坪( ...\n", " --------------------------------------------------------------------------------\n", "Find tags from renewhouse website ...\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Extracting keywords...]: 100%|██████████| 10/10 [00:05<00:00, 1.76it/s]\n", "[Extracting keywords...]: 0%| | 0/1 [00:00