資料擷取－漫漫長路

import html2text
from nltk.stem import PorterStemmer
import pymysql
from os import listdir
from os.path import isfile, isdir, join

# 指定要列出所有檔案的目錄
mypath = "D:/資料擷取/CACM_dataset/cacm/"

# 以列表取得所有檔案與子目錄名稱
files = listdir(mypath)

# 以迴圈處理
for f in files:
    # 產生檔案的絕對路徑
    fullpath = join(mypath, f)
    # 判斷 fullpath 是檔案還是目錄
    if isfile(fullpath):  # 為檔案
        cacm_file = open("cacm/CACM-0001.html", 'r', encoding='utf-8')
    # elif isdir(fullpath):  # 為子目錄，不管它


# file I/O
cacm_file = open("cacm/CACM-0001.html", 'r', encoding='utf-8')
f_content = cacm_file.read()
print(f_content)  # 原文

# file.html transfer to file.txt
h2t = html2text.HTML2Text()
f_content = h2t.handle(f_content)
print(f_content)

# tokenizing 分離每個字，去掉標點符號及空白鍵等
f_content = f_content.split()
print(f_content)
print(type(len(f_content)))
print()

# stemming 還原字根
stemmer = PorterStemmer()
for index in range(len(f_content)):
    print(stemmer.stem(f_content[index]))


    # 第一組是title
    # 然後CACM之前是abstract
    # CACM是日期
    # 下面再看有沒有作者

# 打开数据库连接
db = pymysql.connect("localhost", "root", "", "paper")

# 使用 cursor() 方法创建一个游标对象 cursor
cursor = db.cursor()

# 使用 execute()  方法执行 SQL 查询
cursor.execute("SELECT VERSION()")

# 使用 fetchone() 方法获取单条数据.
data = cursor.fetchone()

print("Database version : %s " % data)

# 关闭数据库连接
db.close()

漫漫長路

KR 發表在痞客邦留言(0) 人氣()

E-mail轉寄

漫漫長路

真理有時候會受到痛擊環境有時候迫使人低頭願我能夠堅持作對的事查覺到錯的行為並且改正

資料擷取

留言列表

站方公告

活動快報

SHARP ...

我的好友

熱門文章

文章分類

最新文章

最新留言

動態訂閱

文章精選

文章搜尋

新聞交換(RSS)

誰來我家

參觀人氣

QR Code

POWERED BY

我的連結

RSS訂閱

漫漫長路

真理有時候會受到痛擊 環境有時候迫使人低頭 願我能夠堅持作對的事 查覺到錯的行為並且改正

資料擷取

留言列表

站方公告

活動快報

SHARP ...

我的好友

熱門文章

文章分類

最新文章

最新留言

動態訂閱

文章精選

文章搜尋

新聞交換(RSS)

誰來我家

參觀人氣

QR Code

POWERED BY

我的連結

RSS訂閱

真理有時候會受到痛擊環境有時候迫使人低頭願我能夠堅持作對的事查覺到錯的行為並且改正