- Enhanced Log Management. Centralizing log storage in MongoDB, reduced the dependency of PubSub, allowing log error detection.
- API Token. Allow users to generate API tokens and use them to integrate into their own systems.
- Web Hook. Trigger a Web Hook http request to pre-defined URL when a task starts or finishes.
- Auto Install Dependencies. Allow installing dependencies automatically from
requirements.txt
orpackage.json
. - Auto Results Collection. Set results collection to
results_<spider_name>
if it is not set. - Optimized Project List. Not display "No Project" item in the project list.
- Upgrade Node.js. Upgrade Node.js version from v8.12 to v10.19.
- Add Run Button in Schedule Page. Allow users to manually run task in Schedule Page.
- Cannot register. #670
- Spider schedule tab cron expression shows second. #678
- Missing daily stats in spider. #684
- Results count not update in time. #689
- Challenges. Users can achieve different challenges based on their actions.
- More Advanced Access Control. More granular access control, e.g. normal users can only view/manage their own spiders/projects and admin users can view/manage all spiders/projects.
- Feedback. Allow users to send feedbacks and ratings to Crawlab team.
- Better Home Page Metrics. Optimized metrics display on home page.
- Configurable Spiders Converted to Customized Spiders. Allow users to convert their configurable spiders into customized spiders which are also Scrapy spiders.
- View Tasks Triggered by Schedule. Allow users to view tasks triggered by a schedule. #648
- Support Results De-Duplication. Allow users to configure de-duplication of results. #579
- Support Task Restart. Allow users to re-run historical tasks.
- CLI unable to use on Windows. #580
- Re-upload error. #643 #640
- Upload missing folders. #646
- Unable to add schedules in Spider Page.
- Support Installations of More Programming Languages. Now users can install or pre-install more programming languages including Java, .Net Core and PHP.
- Installation UI Optimization. Users can better view and manage installations on Node List page.
- More Git Support. Allow users to view Git Commits record, and allow checkout to corresponding commit.
- Support Hostname Node Registration Type. Users can set hostname as the node key as the unique identifier.
- RPC Support. Added RPC support to better manage node communication.
- Run On Master Switch. Users can determine whether to run tasks on master. If not, all tasks will be run only on worker nodes.
- Disabled Tutorial by Default.
- Added Related Documentation Sidebar.
- Loading Page Optimization.
- Duplicated Nodes. #391
- Duplicated Spider Upload. #603
- Failure in dependencies installation results in unusable dependency installation functionalities.. #609
- Create Tasks for Offline Nodes. #622
- Better Support for Scrapy. Spiders identification,
settings.py
configuration, log level selection, spider selection. #435 - Git Sync. Allow users to sync git projects to Crawlab.
- Long Task Support. Users can add long-task spiders which is supposed to run without finishing. #425
- Spider List Optimization. Tasks count by status, tasks detail popup, legend. #425
- Upgrade Check. Check latest version and notifiy users to upgrade.
- Spiders Batch Operation. Allow users to run/stop spider tasks and delete spiders in batches.
- Copy Spiders. Allow users to copy an existing spider to create a new one.
- Wechat Group QR Code.
- Schedule Spider Selection Issue. Fields not responding to spider change.
- Cron Jobs Conflict. Possible bug when two spiders set to the same time of their cron jobs. #515 #565
- Task Log Issue. Different tasks write to the same log file if triggered at the same time. #577
- Task List Filter Options Incomplete.
- SDK for Node.js. Users can apply SDK in their Node.js spiders.
- Log Management Optimization. Log search, error highlight, auto-scrolling.
- Task Execution Process Optimization. Allow users to be redirected to task detail page after triggering a task.
- Task Display Optimization. Added "Param" in the Latest Tasks table in the spider detail page. #295
- Spider List Optimization. Added "Update Time" and "Create Time" in spider list page.
- Page Loading Placeholder.
- Interactive Tutorial. Guide users through the main functionalities of Crawlab.
- Global Environment Variables. Allow users to set global environment variables, which will be passed into all spider programs. #177
- Project. Allow users to link spiders to projects. #316
- Demo Spiders. Added demo spiders when Crawlab is initialized. #379
- User Admin Optimization. Restrict privilleges of admin users. #456
- Setting Page Optimization.
- Task Results Optimization.
- Unable to find spider file error. #485
- Click delete button results in redirect. #480
- Unable to create files in an empty spider. #479
- Download results error. #465
- crawlab-sdk CLI error. #458
- Page refresh issue. #441
- Results not support JSON. #202
- Getting all spider after deleting a spider.
- i18n warning.
- Email Notification. Allow users to send email notifications.
- DingTalk Robot Notification. Allow users to send DingTalk Robot notifications.
- Wechat Robot Notification. Allow users to send Wechat Robot notifications.
- API Address Optimization. Added relative URL path in frontend so that users don't have to specify
CRAWLAB_API_ADDRESS
explicitly. - SDK Compatiblity. Allow users to integrate Scrapy or general spiders with Crawlab SDK.
- Enhanced File Management. Added tree-like file sidebar to allow users to edit files much more easier.
- Advanced Schedule Cron. Allow users to edit schedule cron with visualized cron editor.
nil retuened
error.- Error when using HTTPS.
- Unable to run Configurable Spiders on Spider List.
- Missing form validation before uploading spider files.
- Dependency Installation. Allow users to install/uninstall dependencies and add programming languages (Node.js only for now) on the platform web interface.
- Pre-install Programming Languages in Docker. Allow Docker users to set
CRAWLAB_SERVER_LANG_NODE
asY
to pre-installNode.js
environments. - Add Schedule List in Spider Detail Page. Allow users to view / add / edit schedule cron jobs in the spider detail page. #360
- Align Cron Expression with Linux. Change the expression of 6 elements to 5 elements as aligned in Linux.
- Enable/Disable Schedule Cron. Allow users to enable/disable the schedule jobs. #297
- Better Task Management. Allow users to batch delete tasks. #341
- Better Spider Management. Allow users to sort and filter spiders in the spider list page.
- Added Chinese
CHANGELOG
. - Added Github Star Button at Nav Bar.
- Schedule Cron Task Issue. #423
- Upload Spider Zip File Issue. #403 #407
- Exit due to Network Failure. #340
- Cron Jobs not Running Correctly
- Schedule List Columns Mis-positioned
- Clicking Refresh Button Redirected to 404 Page
- Disclaimer. Added page for Disclaimer.
- Call API to fetch version. #371
- Configure to allow user registration. #346
- Allow adding new users.
- More Advanced File Management. Allow users to add / edit / rename / delete files. #286
- Optimized Spider Creation Process. Allow users to create an empty customized spider before uploading the zip file.
- Better Task Management. Allow users to filter tasks by selecting through certian criterions. #341
- Spiderfile Optimization. Stages changed from dictionary to array. #358
- Baidu Tongji Update.
- Configurable Spider. Allow users to add spiders using Spiderfile to configure crawling rules.
- Execution Mode. Allow users to select 3 modes for task execution: All Nodes, Selected Nodes and Random.
- Task accidentally killed. #306
- Documentation fix. #301 #301
- Direct deploy incompatible with Windows. #288
- Log files lost. #269
- Graceful Showdown. detail
- Node Info Optimization. detail
- Append System Environment Variables to Tasks. detail
- Auto Refresh Task Log. detail
- Enable HTTPS Deployment. detail
- Unable to fetch spider list info in schedule jobs. detail
- Unable to fetch node info from worker nodes. detail
- Unable to select node when trying to run spider tasks. detail
- Unable to fetch result count when result volume is large. #260
- Node issue in schedule tasks. #244
- Docker Image Optimization. Split docker further into master, worker, frontend with alpine image.
- Unit Tests. Covered part of the backend code with unit tests.
- Frontend Optimization. Login page, button size, hints of upload UI optimization.
- More Flexible Node Registration. Allow users to pass a variable as key for node registration instead of MAC by default.
- Uploading Large Spider Files Error. Memory crash issue when uploading large spider files. #150
- Unable to Sync Spiders. Fixes through increasing level of write permission when synchronizing spider files. #114
- Spider Page Issue. Fixes through removing the field "Site". #112
- Node Display Issue. Nodes do not display correctly when running docker containers on multiple machines. #99
- Golang Backend: Refactored code from Python backend to Golang, much more stability and performance.
- Node Network Graph: Visualization of node typology.
- Node System Info: Available to see system info including OS, CPUs and executables.
- Node Monitoring Enhancement: Nodes are monitored and registered through Redis.
- File Management: Available to edit spider files online, including code highlight.
- Login/Regiser/User Management: Require users to login to use Crawlab, allow user registration and user management, some role-based authorization.
- Automatic Spider Deployment: Spiders are deployed/synchronized to all online nodes automatically.
- Smaller Docker Image: Slimmed Docker image and reduced Docker image size from 1.3G to ~700M by applying Multi-Stage Build.
- Node Status. Node status does not change even though it goes offline actually. #87
- Spider Deployment Error. Fixed through Automatic Spider Deployment #83
- Node not showing. Node not able to show online #81
- Cron Job not working. Fixed through new Golang backend #64
- Flower Error. Fixed through new Golang backend #57
- Documentation: Better and much more detailed documentation.
- Better Crontab: Make crontab expression through crontab UI.
- Better Performance: Switched from native flask engine to
gunicorn
. #78
- Deleting Spider. Deleting a spider does not only remove record in db but also removing related folder, tasks and schedules. #69
- MongoDB Auth. Allow user to specify
authenticationDatabase
to connect tomongodb
. #68 - Windows Compatibility. Added
eventlet
torequirements.txt
. #59
- Docker: User can run docker image to speed up deployment.
- CLI: Allow user to use command-line interface to execute Crawlab programs.
- Upload Spider: Allow user to upload Customized Spider to Crawlab.
- Edit Fields on Preview: Allow user to edit fields when previewing data in Configurable Spider.
- Spiders Pagination. Fixed pagination problem in spider page.
- Automatic Extract Fields: Automatically extracting data fields in list pages for configurable spider.
- Download Results: Allow downloading results as csv file.
- Baidu Tongji: Allow users to choose to report usage info to Baidu Tongji.
- Results Page Pagination: Fixes so the pagination of results page is working correctly. #45
- Schedule Tasks Duplicated Triggers: Set Flask DEBUG as False so that schedule tasks won't trigger twice. #32
- Frontend Environment: Added
VUE_APP_BASE_URL
as production mode environment variable so the API call won't be alwayslocalhost
in deployed env #30
- Configurable Spider: Allow users to create a spider to crawl data without coding.
- Advanced Stats: Advanced analytics in spider detail view.
- Sites Data: Added sites list (China) for users to check info such as robots.txt and home page response time/code.
- Basic Stats: User can view basic stats such as number of failed tasks and number of results in spiders and tasks pages.
- Near Realtime Task Info: Periodically (5 sec) polling data from server to allow view task info in a near-realtime fashion.
- Scheduled Tasks: Allow users to set up cron-like scheduled/periodical tasks using apscheduler.
- Initial Release