Skip to content

Commit

Permalink
chore: add known issues to readme
Browse files Browse the repository at this point in the history
  • Loading branch information
vTuanpham committed Nov 12, 2023
1 parent bc75971 commit b222e2e
Showing 1 changed file with 39 additions and 32 deletions.
71 changes: 39 additions & 32 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,47 +30,47 @@
* Unlimited translation, no api key required
* ### Dataset like ELI5, OpenOcra that have over 100k examples that will take up more than a 1000 hours on a single thread can be translate in under 2 hours

* ## Setup
#### Have python 3.8 or above
##### Setup on local machine
```sh
git clone https://github.com/vTuanpham/Large_dataset_translator.git
* ## Setup
#### Have python 3.8 or above
##### Setup on local machine
```sh
git clone https://github.com/vTuanpham/Large_dataset_translator.git

cd Large_dataset_translator
cd Large_dataset_translator

# setup virtual env
virtualenv trans-env
# setup virtual env
virtualenv trans-env

# Activate virtual env
source trans-env/bin/activate
# Activate virtual env
source trans-env/bin/activate

# Install package into virtual env
pip install -r requirements.txt
```
# Install package into virtual env
pip install -r requirements.txt
```

##### Setup on colab
```sh
!git clone https://github.com/vTuanpham/Large_dataset_translator.git
##### Setup on colab
```sh
!git clone https://github.com/vTuanpham/Large_dataset_translator.git
%cd Large_dataset_translator
%cd Large_dataset_translator
%pip install -r requirements.txt
```
* ## Test
#### This should take about 10-20mins on local or 5-10mins on colab
##### Running test on local machine
```sh
python examples/YahmaAlpaca/AlpacaCleaned_Parser.py
%pip install -r requirements.txt
```
##### Running test on colab [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1OEni8c9N9C_9Kf3ySt87goN7HDvRN3nw?usp=sharing)
* ## Test
#### This should take about 10-20mins on local or 5-10mins on colab
##### Running test on local machine
```sh
python examples/YahmaAlpaca/AlpacaCleaned_Parser.py
```
##### Running test on colab [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1OEni8c9N9C_9Kf3ySt87goN7HDvRN3nw?usp=sharing)

```sh
%run examples/YahmaAlpaca/AlpacaCleaned_Parser.py
```
* Be sure to check the output in the examples/YahmaAlpaca dir, there should be an english version and a Vietnamese version
* If all things go well, you should have your own dataset translated !
* yahma/alpaca-cleaned datasets on huggingface-hub has 51.8k rows of data
* Don't recommend ctrl+C in the middle of translation
```sh
%run examples/YahmaAlpaca/AlpacaCleaned_Parser.py
```
* Be sure to check the output in the examples/YahmaAlpaca dir, there should be an english version and a Vietnamese version
* If all things go well, you should have your own dataset translated !
* yahma/alpaca-cleaned datasets on huggingface-hub has 51.8k rows of data
* Don't recommend ctrl+C in the middle of translation
* ## Remember to leave a star 🌞 if the test was successful :)
## Translate your own dataset
* #### Look through all the examples in the examples/ dir first !
Expand Down Expand Up @@ -343,3 +343,10 @@
</tr>
</table>
* Known issues:
* 'TypeError: "NoneType' object is not iterable"
This issue is relevant to gender specific translation, you can read more here https://github.com/ssut/py-googletrans/issues/260
* Weird bug where if a chunk only has a single thread,
only examples in that thread is saved. For now, adjust the ***max_example_per_thread***
and the ***large_chunks_threshold*** args so that the last chunk cannot contain a single thread

0 comments on commit b222e2e

Please sign in to comment.