Identifying Ad Images

Code to differentiate ad from non-ad images based on the geometry of the image (if available) as well as phrases occurring in the URL, the image's URL and alt text, the anchor text, and words occurring near the anchor text.

Summary of Strategy and Results

Strategy

The challenges to building a good model were:

Non-random missing data in the continuous variables.
A large number of features given the size of the sample (1,558 vs 3,279).
The overwhelming majority of the features are sparse.
There is only a moderate number of observations.
High class imbalance.

I tackled these challenges by:

Turning the continuous variables into binary variables where missing values is a feature.
Using algorithms robust to unfavorable feature to observation ratios -- like random forest, which will only use a sample of the features per model fit.
Using variance threshold based feature selection and regularized models to avoid introducing too many uninformative sparse features to the model.
Avoiding more data-hungry cross-validation strategies like nested cross-validation.
Adjusting class weights and using sampling techniques like SMOTE and Tomek Link removal.

Results

ROC curves on the test data:

The above strategies resulted in highly predictive models. The best iteration of each model explored had an AUC ROC of 0.95 or greater. The best model was the logistic classifier with a 1:1 class weight, the feature variance threshold set to drop only zero variance features, and no sampling-based class imbalance corrections.

The performance of the best model was highly stable. The standard deviation of the validation fold AUC ROC was 0.012, a tiny fraction of the average AUC ROC.

The most important features in the dataset seemed reasonable. Listed by order of importance (identified using random forest), the top five are:

ancurl*adclick
ancurl*adid
origurl*misfits2
url*static.wired.com
ancurl*http+www

Features 1, 2, & 5 seem to be ad attributes and 3 & 4 seem to be the url of the ad source.

Many more model training approaches were left off of the table. For example, only L2 (ridge) regularization was used for the logistic classifier and SVM were not used (to save time on first iteration of training). However, since the first iteration of training yielded models with an AUC ROC of 0.99, further refinement of the model training process seemed unnecessary.

Key Files

Installing

Uses Python 3.5 and anaconda

Linux

Change into the directory where you want to place the repo
Clone it: git clone https://github.com/jcharit1/Identifying-Ad-Images.git
Change into repo directory: cd Identifying-Ad-Images/
Edit the environment file prefix (at the end) to reflect your anaconda directory
Copy the environment
1. Option 1, partially copy the environment: conda env create -f environment_lite.yml
2. Option 2, copy the full environment: conda env create -f environment.yml
Load the environment: source activate Jimmy_Charite_py35
Make the script executable: chmod +x ./code/predict_image_type.py
Define the following full file paths via bash variables:
1. path_old_data (all of the data used to train and test the models)
2. path_colnames (use the file in the ./raw_data/ subdirectory of the repo)
3. path_new_data (data for new out of sample predictions)
4. path_pred_file (where the predictions will be saved)
Run the prediction script: ./code/predict_image_type.py $path_old_data $path_colnames $path_new_data $path_pred_file

Copying the full python environment will take 10-15 minutes on a slow internet connection. However, OS specifics aside, it will get you a full mirror of my python environment. Then you should be able run all the notebooks and scripts with, hopefully, no errors. The limited environment (environment_lite.yml) should be sufficient for running the prediction script.

Windows

TO DO

Uninstalling

Linux

Delete the repo: rm -f Identifying-Ad-Images/
Remove the environment: conda env remove --name Jimmy_Charite_py35

Windows

TO DO

Contributing

Fork it!
Create your feature branch: git checkout -b my-new-feature
Commit your changes: git commit -am 'Add some feature'
Push to the branch: git push origin my-new-feature
Submit a pull request :D

Author

Jimmy Charité [email protected]

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
clean_data		clean_data
code		code
model_para		model_para
plots		plots
raw_data		raw_data
.gitignore		.gitignore
License.md		License.md
README.md		README.md
environment.yml		environment.yml
environment_lite.yml		environment_lite.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Identifying Ad Images

Summary of Strategy and Results

Strategy

Results

Key Files

Installing

Linux

Windows

Uninstalling

Linux

Windows

Contributing

Author

License

About

Releases

Packages

Languages

License

jcharit1/Identifying-Ad-Images

Folders and files

Latest commit

History

Repository files navigation

Identifying Ad Images

Summary of Strategy and Results

Strategy

Results

Key Files

Installing

Linux

Windows

Uninstalling

Linux

Windows

Contributing

Author

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages