|
| 1 | +# Job Vacancy Web Scraping Project |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +Demo project to practice Python and Machine Learning technologies. |
| 6 | + |
| 7 | +Project extracts job posting from jobs sites and analizes the data. |
| 8 | + |
| 9 | +## Prerequisites |
| 10 | + |
| 11 | +Python 3.6 or higher. |
| 12 | + |
| 13 | +See `requirements.txt` in the folder. |
| 14 | + |
| 15 | +## Getting Started (Linux/Unix instructions) |
| 16 | + |
| 17 | +1. **Clone this repo from GitHub:** |
| 18 | +2. **Create a Python virtual in the folder that contains this repo:** |
| 19 | + |
| 20 | +```bash |
| 21 | + python -m venv venv |
| 22 | +``` |
| 23 | + |
| 24 | +3. **Activate the virtual environment:** |
| 25 | + |
| 26 | +```bash |
| 27 | + source venv/bin/activate |
| 28 | +``` |
| 29 | + |
| 30 | +4. **Install the project dependencies:** |
| 31 | + |
| 32 | +```bash |
| 33 | + pip install --no-cache-dir --upgrade -r requirements.txt |
| 34 | +``` |
| 35 | + |
| 36 | +5. **Navigate to the application folder:** |
| 37 | + |
| 38 | +```bash |
| 39 | + cd app |
| 40 | +``` |
| 41 | + |
| 42 | +6. **Set the PYTHONPATH environment variable:** |
| 43 | + |
| 44 | +```bash |
| 45 | + export PYTHONPATH="./" |
| 46 | +``` |
| 47 | + |
| 48 | +7. **First time database setup: create SQLite database to store the jobs:** |
| 49 | + |
| 50 | +```bash |
| 51 | + python scripts/create_database.py |
| 52 | +``` |
| 53 | + |
| 54 | +8. **Create a directory to store logs:** |
| 55 | + |
| 56 | +```bash |
| 57 | + mkdir ../logs |
| 58 | +``` |
| 59 | + |
| 60 | +9. **Execute program to scrape jobs(first see "Set job search parameters"):** |
| 61 | + |
| 62 | +```bash |
| 63 | + python scrape_jobs.py |
| 64 | +``` |
| 65 | + |
| 66 | +### Set job search parameters: |
| 67 | + |
| 68 | +Currently only scraped site is [Jobserve](https://www.jobserve.com/). If you are interested in a particular set of jobs to store in the database you can populate the Jobserve Job Search form and perform your search. Then use the session Id (shid) that appear in the browser querystring to target these set. To do this, follow the steps: |
| 69 | + |
| 70 | +1. Go to `https://www.jobserve.com/gb/en/Job-Search/` and set the values for your job search in the search form. |
| 71 | +2. After hitting Search button, you will redirected to a search results page. |
| 72 | +3. From the URL you can obtain the `session-id` value `https://www.jobserve.com/gb/en/JobSearch.aspx?shid=<session-id>` |
| 73 | +4. Use the `config-base.json` to populate the value of the `shid` |
| 74 | + |
| 75 | +```json |
| 76 | +{ |
| 77 | + "jobserve-shid": "<session-id>" |
| 78 | +} |
| 79 | +``` |
| 80 | + |
| 81 | +5. Rename this `config-base.json` to `config.json` |
| 82 | + |
| 83 | +NOTE: After few days not accessing Jobserve with this `session-id`, it will expire and you will need to repopulate the search as explained in previous steps. |
| 84 | + |
| 85 | +**Additional Notes**: |
| 86 | + |
| 87 | +- Ensure you have Python 3.6 or higher installed on your system. |
| 88 | +- Run the unit test from the `app` folder executing `pytest`. |
| 89 | +- Check the logs in the `../logs` directory if you encounter any issues running the application. |
| 90 | + |
| 91 | +Build the Docker Image |
| 92 | + |
| 93 | +``` |
| 94 | +docker build -t scrapejobs . |
| 95 | +``` |
| 96 | + |
| 97 | +Run the Docker Container |
| 98 | + |
| 99 | +``` |
| 100 | +docker run -p 8000:80 scrapejobs |
| 101 | +``` |
| 102 | + |
| 103 | +Inspect the application endpoints from the browser with |
| 104 | + |
| 105 | +``` |
| 106 | +http://127.0.0.1:8000/docs |
| 107 | +``` |
| 108 | + |
| 109 | +Execute on WSL or Docker container |
| 110 | +https://stackoverflow.com/questions/75012949/how-to-run-chrome-headless-in-docker-container-with-selenium |
| 111 | + |
| 112 | +wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb |
| 113 | +sudo apt install ./google-chrome-stable_current_amd64.deb |
| 114 | + |
| 115 | +CHROME_VERSION=$(google-chrome --version | cut -d ' ' -f 3 | cut -d '.' -f 1) |
| 116 | +For chromedriver-linux64.zip go to https://googlechromelabs.github.io/chrome-for-testing/ |
| 117 | +and install same version you have in echo $CHROME_VERSION |
| 118 | +unzip chromedriver-linux64.zip |
| 119 | +go inside the folder where you unzipped /chromedriver-linux64 |
| 120 | +and move the chromedriver as below |
| 121 | +sudo mv chromedriver /usr/bin/chromedriver |
| 122 | +sudo chmod +x /usr/bin/chromedriver |
| 123 | + |
| 124 | +https://sites.google.com/chromium.org/driver/downloads |
| 125 | +If you are using Chrome version 115 or newer, please consult the Chrome for Testing availability dashboard. This page provides convenient JSON endpoints for specific ChromeDriver version downloading. |
| 126 | + |
| 127 | +## Project Structure |
| 128 | + |
| 129 | +- [Outline the project's file structure and explain the purpose of each file/directory] |
| 130 | + |
| 131 | +## Data Source |
| 132 | + |
| 133 | +[Specify the job posting site you chose for web scraping and provide a brief justification for your choice. If you encountered any anti-scraping filters, mention how you handled them.] |
| 134 | + |
| 135 | +Scraping jobs advertisements from [JobServe](https://www.jobserve.com/) site. |
| 136 | + |
| 137 | +## Data Processing |
| 138 | + |
| 139 | +[Explain the steps involved in processing the scraped data, such as cleaning, transforming, or filtering the data.] |
| 140 | + |
| 141 | +## Analysis and Visualization |
| 142 | + |
| 143 | +[Describe the techniques and tools you used for analyzing and visualizing the data. You can include sample visualizations or screenshots in this section.] |
| 144 | + |
| 145 | +## Findings and Recommendations |
| 146 | + |
| 147 | +[Summarize your key findings and provide recommendations based on your analysis. This section should highlight the insights gained from the project and how they can help the recruitment agency achieve its objectives.] |
| 148 | + |
| 149 | +## Limitations and Future Improvements |
| 150 | + |
| 151 | +[Discuss any limitations or challenges you faced during the project and suggest potential improvements or future enhancements.] |
| 152 | + |
| 153 | +## License |
| 154 | + |
| 155 | +This project is licensed under the [MIT License](https://opensource.org/licenses/MIT). |
0 commit comments