Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No such file or directory: './data/text_dict' #2

Open
GabrielLin opened this issue Apr 11, 2021 · 8 comments
Open

No such file or directory: './data/text_dict' #2

GabrielLin opened this issue Apr 11, 2021 · 8 comments

Comments

@GabrielLin
Copy link

When I run the generate_dataset.py, the following error was shown. I am afraid that there is nothing in the data folder. Could you help? Thanks.

Reading text vocabulary from './data/text_dict'...
Traceback (most recent call last):
  File "generate_dataset.py", line 301, in <module>
    dicts['text'] = initVocabulary('text', None, './data/text_dict', 50000, ' ', False)
  File "generate_dataset.py", line 283, in initVocabulary
    vocab.loadFile(vocabFile)  
  File "generate_dataset.py", line 73, in loadFile
    for line in open(filename):
FileNotFoundError: [Errno 2] No such file or directory: './data/text_dict'
@abhigupta768
Copy link
Owner

You need to first build the text_dict, authors_dict, and the pvs_dict.

Change the lines 300 - 312 in the generate_dataset.py to the following:

dicts['text'] = initVocabulary('text', ['./data/abstract_train.p', './title_train.p'], None, 50000, ' ', False)
dicts['authors'] = initVocabulary('authors', './data/authors_train.p', None, 20000, ' ', False)

saveVocabulary('text', dicts['text'], './data/text_dict')
saveVocabulary('authors', dicts['authors'], './data/authors_dict')

pvs=p.load(open("./pv_train.p","rb"))
unique_pvs=np.unique(np.array(pvs))
dicts['pvs']=Dict()
for pv in unique_pvs:
   dicts['pvs'].add(pv)
dicts['pvs'] = dicts['pvs'].prune(300)
saveVocabulary('pvs', dicts['pvs'], './data/pvs_dict')

After you have built these files once, you can revert the code back to the original.

Please let me know if you have any questions. Thanks!

@GabrielLin
Copy link
Author

I follow your instructions and the script has been run for more than two days but no output. I will try to run from the very beginning again. It might cost a little time. Thanks.

@GabrielLin
Copy link
Author

Sorry for the late reply. I have tried some many times. But the modified script just keep running but did not end.

@abhigupta768
Copy link
Owner

abhigupta768 commented May 17, 2021

Hey, sorry for the delay in reply. Can you provide some sample data that you are giving as input to the scripts?

If you can provide some sample data, I will look into it over the weekend and find the issue.

Thanks!

@GabrielLin
Copy link
Author

GabrielLin commented May 18, 2021

Thank you. Here is the step I wrote.

My Readme

This repository contains code for the Modular-Hierarchical Attention Based Scholarly Venue Recommender System using Deep Learning

Checking on Ubuntu 16.04.4 LTS

Ref Repo

Updated to https://github.com/abhigupta768/publication-venue-recommender/tree/530702eb0552aafb8f8517b329579610e1a7aa81

Ref Paper

Pradhan, T., Gupta, A., & Pal, S. (2020). HASVRec: A modularized hierarchical attention-based scholarly venue recommender system. Knowledge-Based Systems, 204, 106181. doi:10.1016/j.knosys.2020.106181

Dependencies

Python Environment

conda create -c conda-forge -n py36gpvr python=3.6
conda activate py36gpvr

conda install -c pytorch pytorch==1.7.1 torchvision==0.8.2 cudatoolkit==10.2.89 cudnn==7.6.5

conda install -c conda-forge nltk==3.6.1
conda install -c conda-forge pandas==1.1.5


Data

Download AMiner-Paper.rar from https://lfs.aminer.cn/lab-datasets/aminerdataset/AMiner-Paper.rar;
The above file is converted into zip format and stored outside;

Unzip the zip file and place it at project root

unzip AMiner-Paper.zip

Extract

python extract_data.py

move all .p files to ./data folder

cp *.p data/

Generate the PyTorch compatible format

nohup python -u generate_dataset_init.py > generate_dataset_init.log 2>&1 &
tailf generate_dataset_init.log

@GabrielLin
Copy link
Author

In the above step, generate_dataset_init.py is the file I modified by your suggestions. The content is

# -*- coding: utf-8 -*-
"""pubrec-generate-dataset.ipynb

Automatically generated by Colaboratory.

Original file is located at
    https://colab.research.google.com/drive/1yowkK19M7YGLZTbjX7g4dTEaQphvp0a_
"""

# Commented out IPython magic to ensure Python compatibility.
# The following two lines are commented by me
# from google.colab import drive
# drive.mount('/gdrive')
# %cd /gdrive

# Commented out IPython magic to ensure Python compatibility.
# %cd My\ Drive/pubrec

import torch
import torch.utils.data as torch_data

import time
import csv
import json as js
import os
import codecs
import pickle as p

import numpy as np

import nltk
nltk.download('punkt')

PAD = 0
UNK = 1
BOS = 2
EOS = 3

PAD_WORD = '<blank>' 
UNK_WORD = 'UNK'
BOS_WORD = '<s>'
EOS_WORD = '</s>'
SPA_WORD = ' '

def flatten(l): 
    for el in l:
        if hasattr(el, "__iter__"):
            for sub in flatten(el):
                yield sub
        else:
            yield el

class Dict(object):
    def __init__(self, data=None, lower=False):
        self.idxToLabel = {}
        self.labelToIdx = {}
        self.frequencies = {}
        self.lower = lower
        # Special entries will not be pruned.
        self.special = [] 

        if data is not None:
            if type(data) == str:
                self.loadFile(data)
            else:
                self.addSpecials(data)

    def size(self):
        return len(self.idxToLabel)

    # Load entries from a file.
    def loadFile(self, filename):
        for line in open(filename):
            fields = line.split()
            label = ' '.join(fields[:-1])
            idx = int(fields[-1])
            self.add(label, idx)

    # Write entries to a file.
    def writeFile(self, filename):
        with open(filename, 'w') as file:
            for i in range(self.size()):
                label = self.idxToLabel[i]
                file.write('%s %d\n' % (label, i))

        file.close()

    def loadDict(self, idxToLabel):
        for i in range(len(idxToLabel)):
            label = idxToLabel[i]
            self.add(label, i)

    def lookup(self, key, default=None):
        key = key.lower() if self.lower else key
        try:
            return self.labelToIdx[key]
        except KeyError:
            return default

    def getLabel(self, idx, default=None):
        try:
            return self.idxToLabel[idx]
        except KeyError:
            return default

    # Mark this `label` and `idx` as special (i.e. will not be pruned).
    def addSpecial(self, label, idx=None):
        idx = self.add(label, idx)
        self.special += [idx]

    # Mark all labels in `labels` as specials (i.e. will not be pruned).
    def addSpecials(self, labels):
        for label in labels:
            self.addSpecial(label)

    # Add `label` in the dictionary. Use `idx` as its index if given.
    def add(self, label, idx=None):
        label = label.lower() if self.lower else label
        if idx is not None:
            self.idxToLabel[idx] = label
            self.labelToIdx[label] = idx
        else:
            if label in self.labelToIdx:
                idx = self.labelToIdx[label]
            else:
                idx = len(self.idxToLabel)
                self.idxToLabel[idx] = label
                self.labelToIdx[label] = idx

        if idx not in self.frequencies:
            self.frequencies[idx] = 1
        else:
            self.frequencies[idx] += 1

        return idx

    # Return a new dictionary with the `size` most frequent entries.
    def prune(self, size):
        if size >= self.size():
            return self

        # Only keep the `size` most frequent entries.
        freq = torch.Tensor(
                [self.frequencies[i] for i in range(len(self.frequencies))])
        _, idx = torch.sort(freq, 0, True)
        newDict = Dict()
        newDict.lower = self.lower

        # Add special entries in all cases.
        for i in self.special:
            newDict.addSpecial(self.idxToLabel[i])

        for i in idx[:size]:
            newDict.add(self.idxToLabel[i.item()])

        return newDict

    # Convert `labels` to indices. Use `unkWord` if not found.
    # Optionally insert `bosWord` at the beginning and `eosWord` at the .
    def convertToIdx(self, labels, unkWord, bosWord=None, eosWord=None):
        vec = []

        if bosWord is not None:
            vec += [self.lookup(bosWord)]

        unk = self.lookup(unkWord)
        vec += [self.lookup(label, default=unk) for label in labels]

        if eosWord is not None:
            vec += [self.lookup(eosWord)]

        vec = [x for x in flatten(vec)]

        return torch.LongTensor(vec)

    # Convert `idx` to labels. If index `stop` is reached, convert it and return.
    def convertToLabels(self, idx, stop):
        labels = []

        for i in idx:
            if i == stop:
                break
            labels += [self.getLabel(i)]

        return labels

class AttrDict(dict):

    def __init__(self, *args, **kwargs):
        super(AttrDict, self).__init__(*args, **kwargs)
        self.__dict__ = self

def read_config(path):
    return AttrDict(js.load(open(path, 'r')))


def format_time(t):
    return time.strftime("%Y-%m-%d-%H:%M:%S", t)


def logging(file):
    def write_log(s):
        print(s, end='')
        with open(file, 'a') as f:
            f.write(s)
    return write_log


def logging_csv(file):
    def write_csv(s):
        with open(file, 'a', newline='') as f:
            writer = csv.writer(f)
            writer.writerow(s)
    return write_csv

class dataset(torch_data.Dataset):

    def __init__(self, text_data, label_data):
        self.text_data = text_data
        self.label_data = label_data
  
    def __getitem__(self, index):
        return [torch.from_numpy(x[index]).type(torch.FloatTensor) for x in self.text_data],\
               torch.from_numpy(self.label_data[index]).type(torch.FloatTensor)

    def __len__(self):
        return len(self.label_data)
       

def get_loader(dataset, batch_size, shuffle, num_workers):

    data_loader = torch.utils.data.DataLoader(dataset=dataset,
                                              batch_size=batch_size,
                                              shuffle=shuffle,
                                              num_workers=num_workers)
    return data_loader

def makeVocabulary(filename, size, sep=' ', char=False):

    vocab = Dict([PAD_WORD, UNK_WORD], lower=True)
    if char:
        vocab.addSpecial(SPA_WORD)

    lengths = []

    if type(filename) == list:
        for _filename in filename:
            data = p.load(open(_filename,"rb"))
            for sent in data:
                for word in sent.strip().split(sep):
                    lengths.append(len(word))
                    if char:
                        for ch in word.strip():
                            vocab.add(ch)
                    else:
                        vocab.add(word.strip())
    else:
        data = p.load(open(filename,"rb"))
        for sent in data:
            for word in sent.strip().split(sep):
                lengths.append(len(word))
                if char:
                    for ch in word.strip():
                        vocab.add(ch)
                else:
                    vocab.add(word.strip())
    print('max: %d, min: %d, avg: %.2f' % (max(lengths), min(lengths), sum(lengths)/len(lengths)))

    originalSize = vocab.size()
    vocab = vocab.prune(size)  
    print('Created dictionary of size %d (pruned from %d)' %
          (vocab.size(), originalSize))

    return vocab

def initVocabulary(name, dataFile, vocabFile, vocabSize, sep=' ', char=False):

    vocab = None
    if vocabFile is not None:
        # If given, load existing word dictionary.
        print('Reading ' + name + ' vocabulary from \'' + vocabFile + '\'...')
        vocab = Dict()
        vocab.loadFile(vocabFile)  
        print('Loaded ' + str(vocab.size()) + ' ' + name + ' words')

    if vocab is None:
        # If a dictionary is still missing, generate it.
        print('Building ' + name + ' vocabulary...')
        genWordVocab = makeVocabulary(dataFile, vocabSize, sep=sep, char=char)  
        vocab = genWordVocab

    return vocab


def saveVocabulary(name, vocab, file):

    print('Saving ' + name + ' vocabulary to \'' + file + '\'...')
    vocab.writeFile(file)

dicts = {}
dicts['text'] = initVocabulary('text', ['./data/abstract_train.p', './title_train.p'], None, 50000, ' ', False)
dicts['authors'] = initVocabulary('authors', './data/authors_train.p', None, 20000, ' ', False)

saveVocabulary('text', dicts['text'], './data/text_dict')
saveVocabulary('authors', dicts['authors'], './data/authors_dict')

pvs=p.load(open("./pv_train.p","rb"))
unique_pvs=np.unique(np.array(pvs))
dicts['pvs']=Dict()
for pv in unique_pvs:
   dicts['pvs'].add(pv)
dicts['pvs'] = dicts['pvs'].prune(300)
saveVocabulary('pvs', dicts['pvs'], './data/pvs_dict')

dicts['pvs'] = initVocabulary('pvs', None, './data/pvs_dict', 50000, ' ', False)

abstract = {'text_file': './abstract_train.p', 'text_dict': dicts['text'], 'doc_len': 8, 'text_len': 20}
title = {'text_file': './title_train.p', 'text_dict': dicts['text'], 'text_len': 20}
authors = {'text_file': './authors_train.p', 'text_dict': dicts['authors'], 'text_len': 7}
pv = {'pv_file': './pv_train.p', 'pv_dict': dicts['pvs']}

abstract_val = {'text_file': './abstract_val.p', 'text_dict': dicts['text'], 'doc_len': 8, 'text_len': 20}
title_val = {'text_file': './title_val.p', 'text_dict': dicts['text'], 'text_len': 20}
authors_val = {'text_file': './authors_val.p', 'text_dict': dicts['authors'], 'text_len': 7}
pv_val = {'pv_file': './pv_val.p', 'pv_dict': dicts['pvs']}

abstract_test = {'text_file': './abstract_test.p', 'text_dict': dicts['text'], 'doc_len': 8, 'text_len': 20}
title_test = {'text_file': './title_test.p', 'text_dict': dicts['text'], 'text_len': 20}
authors_test = {'text_file': './authors_test.p', 'text_dict': dicts['authors'], 'text_len': 7}
pv_test = {'pv_file': './pv_test.p', 'pv_dict': dicts['pvs']}

def make_data(abstract, title, authors, pv):

    text_data = []
    text_data.append(make_abstract_data(abstract['text_file'], abstract['text_dict'], abstract['doc_len'], abstract['text_len']))
    text_data.append(make_title_data(title['text_file'], title['text_dict'], title['text_len']))
    text_data.append(make_author_data(authors['text_file'], authors['text_dict'], authors['text_len'], sep=';'))
    pv_data = make_pv_data(pv['pv_file'], pv['pv_dict'])

    return dataset(text_data, pv_data)

def make_abstract_data(text_file, text_dict, doc_length, text_length, sep=' '):
    result = []
    data=p.load(open(text_file,"rb"))
    for line in data:
        temp = np.zeros((doc_length, text_length))
        sents=nltk.sent_tokenize(line)
        for i in range(len(sents)):
            if i < doc_length:
                words = nltk.word_tokenize(sents[i].strip())
                for j in range(len(words)):
                    if j < text_length:
                        temp[i, j] = text_dict.lookup(words[j].lower(), 1)
        result.append(temp)
    return result

def make_title_data(text_file, text_dict, text_length, sep=' '):
    result = []
    data=p.load(open(text_file,"rb"))
    for line in data:
        temp = np.zeros(text_length)
        words = nltk.word_tokenize(line.strip())
        for i in range(len(words)):
            if i < text_length:
                temp[i] = text_dict.lookup(words[i].lower(), 1)
        result.append(temp)
    return result

def make_author_data(text_file, text_dict, text_length, sep=' '):
    result = []
    data=p.load(open(text_file,"rb"))
    for line in data:
        temp = np.zeros(text_length)
        words = line.strip().split(sep)
        for i in range(len(words)):
            if i < text_length:
                temp[i] = text_dict.lookup(words[i].lower(), 1)
        result.append(temp)
    return result

def make_pv_data(pv_file, pv_dict):
    result = []
    length = len(pv_dict.idxToLabel)
    data=p.load(open(pv_file,"rb"))
    for line in data:
        temp = np.zeros(length)
        temp[pv_dict.lookup(str(line), 1)] = 1
        result.append(temp)
    return result

train = make_data(abstract, title, authors, pv)
val = make_data(abstract_val, title_val, authors_val, pv_val)
test = make_data(abstract_test, title_test, authors_test, pv_test)

data = {'train': train, 'val': val, 'test': test}
torch.save(data, './data/final_data_3')

# added by me
print('DONE.')


@Sakura718
Copy link

Sorry for the late reply. I have tried some many times. But the modified script just keep running but did not end.

Hello, I have encountered the same problem as you. Although it has been a while, I would like to ask if you have solved this problem and how it was resolved. Thank you!

@Sakura718
Copy link

Sorry for the late reply. I have tried some many times. But the modified script just keep running but did not end.

Hello, I have encountered the same problem as you. Although it has been a while, I would like to ask if you have solved this problem and how it was resolved. Thank you!

Oh I have already solved it, it is due to the external network proxy! thank you~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants