Athleta — дочерняя компания корпорации Gap, которая занимается разработкой, производством и продажей женской и детской одежды для занятий спортом. Парсер собирает все товары, представленные в интернет-магазине athleta.gap.com.
Примерное количество товаров: 20000
Примерное количество запросов: 20000
Рекомендуемый план подписки: X-Small
ВНИМАНИЕ! Количество запросов может превышать количество товаров, потому что данные о вариациях, изображениях и др. могут парсится используя запросы к дополнительным ресурсам. Также часть данных о товаре может доставляться с помощью XHR запросов, что также увеличивает общее количество необходимых запросов.
Для его использования вы должны иметь учетную запись в нашем сервисе Diggernaut.
- Пройдите по этой ссылке для регистрации в сервисе Diggernaut
- После регистрации и подтверждения email адреса войдите в свою учетную запись
- Создайте проект с любый именем и описанием, если вы не знаете как, обратитесь к нашей документации
- Войдите во вновь созданный проект и создайте в нем диггер с любым именем, если вы не знаете как, обратитесь к нашей документации
- Скопируйте в буфер обмена приведенный ниже сценарий диггера и вставьте его в созданный вами диггер, если вы не знаете как, обратитесь к нашей документации
- Переключите режим работы диггера с Debug на Active, если вы не знаете как, обратитесь к нашей документации
- Запустите ваш диггер и дождитесь окончания его работы, если вы не знаете как, обратитесь к нашей документации
- Скачайте собранный набор данных в нужном вам формате, если вы не знаете как, обратитесь к нашей документации
В дальнейшем вы можете установить расписание для запуска вашего парсера и забирать информацию регулярно.
Сценарий парсера:
---
config:
debug: 2
agent: Firefox
do:
- walk:
to: http://athleta.gap.com/
do:
- find:
path: div.topnav_atol>ul>li>a
do:
- parse:
attr: href
- space_dedupe
- trim
- if:
match: \w+
do:
- link_add:
pool: main
- walk:
to: links
pool: main
do:
- find:
path: .sidebar-navigation
do:
- node_remove: h1
- sequence:
header: h2
selector: h2,div
- find:
path: div.sequence
do:
- variable_clear: catname
- find:
path: h2
do:
- parse
- space_dedupe
- trim
- variable_set: catname
- find:
path: .sidebar-navigation--category--link
do:
- pool_clear: pager
- parse:
attr: href
filter:
- cid=(.+)
- variable_set: cid
- register_set: http://athleta.gap.com/resources/productSearch/v1/search?cid=<%cid%>&locale=en_US&isFacetsEnabled=true
- link_add:
pool: pager
- walk:
to: links
pool: pager
do:
- variable_clear: ptot
- find:
path: pageNumberTotal
do:
- parse
- if:
match: (^\s*[0-1]\s*$)
else:
- variable_set: ptot
- find:
path: pageNumberRequested
do:
- parse
- if:
match: (^\s*0\s*$)
do:
- variable_get: ptot
- if:
match: (\d)
do:
- if:
gt: 1
do:
- eval:
routine: js
body: '(function (){var r = ""; for (var i = 1; i<<%ptot%>; i++){r += ""+i+""}; return r;})();'
- to_block
- find:
path: div
do:
- parse
- variable_set: pageid
- register_set: http://athleta.gap.com/resources/productSearch/v1/search?cid=<%cid%>&locale=en_US&pageId=<%pageid%>&isFacetsEnabled=true
- link_add:
pool: pager
- find:
path: productCategory > name
do:
- parse
- space_dedupe
- trim
- variable_set: catname2
- find:
path: productCategory > childProducts
do:
- find:
path: parentBusinessCatalogItemId
do:
- parse
- if:
match: (\S)
do:
- variable_set: pid
- register_set: http://athleta.gap.com/browse/product.do?pid=<%pid%>&cid=<%cid%>
- walk:
to: value
do:
- variable_clear: isP
- find:
path: script:matches(gap.pageProductData\s*=\s*\{)
do:
- variable_set:
field: isP
value: 1
- find:
path: html
do:
- variable_get: isP
- if:
match: (1)
do:
- object_new: product
- find:
path: head
do:
- eval:
routine: js
body: '(function (){var d = new Date(); return d.toISOString()})();'
- object_field_set:
object: product
field: date
- static_get: url
- object_field_set:
object: product
field: url
- register_set: 'GAP'
- object_field_set:
object: product
field: brand
- find:
path: meta[name="keywords"]
do:
- parse:
attr: content
- object_field_set:
object: product
field: description
- find:
path: script:matches(gap.pageProductData\s*=\s*\{)
do:
- parse:
filter:
- gap\.currentBrand\s*=\s*\"(.+)\"\;
- if:
match: (\S)
do:
- object_field_set:
object: product
field: brand
- parse
- normalize:
routine: replace_substring
args:
var\s*gap\s*=\s*window\.gap\s*\|\|\s*\{\s*\}\;: ''
gap\.pageProductData\s*=\s*: ''
\s*;\s*gap.currentBrand\s*=\s*.*\;: ''
- normalize:
routine: json2xml
- to_block
- find:
path: productimages
do:
- parse:
format: html
- variable_set: imghtml
- find:
path: variants > productstylecolors > productstylecolorimages
do:
- parse
- normalize:
routine: lower
- variable_set: imgpath
- register_set: <%imghtml%>
- to_block
- find:
path: safe_<%imgpath%>
do:
- variable_clear: getit
- find:
path: xlarge
do:
- parse
- if:
match: (\S)
do:
- variable_set:
field: getit
value: 1
- normalize:
routine: url
- object_field_set:
object: product
field: images
joinby: "|"
- variable_get: getit
- if:
match: (1)
else:
- find:
path: large
do:
- parse
- if:
match: (\S)
do:
- variable_set:
field: getit
value: 1
- normalize:
routine: url
- object_field_set:
object: product
field: images
joinby: "|"
- variable_get: getit
- if:
match: (1)
else:
- find:
path: medium
do:
- parse
- if:
match: (\S)
do:
- variable_set:
field: getit
value: 1
- normalize:
routine: url
- object_field_set:
object: product
field: images
joinby: "|"
- variable_get: getit
- if:
match: (1)
else:
- find:
path: small
do:
- parse
- if:
match: (\S)
do:
- variable_set:
field: getit
value: 1
- normalize:
routine: url
- object_field_set:
object: product
field: images
joinby: "|"
- find:
path: body_safe > variants > productstylecolors > colorname
do:
- parse
- if:
match: (\S)
do:
- object_field_set:
object: product
field: variations
joinby: "|"
- find:
path: body_safe > name
do:
- parse
- if:
match: (\S)
do:
- object_field_set:
object: product
field: name
- find:
path: body_safe > currentmaxprice, body_safe > currentminprice
do:
- parse:
filter:
- (\d+\.?\d*)
- if:
match: (\d+)
do:
- object_field_set:
object: product
field: price
type: float
- register_set: USD
- object_field_set:
object: product
field: currency
- find:
path: styleid
slice: 0
do:
- parse
- object_field_set:
object: product
field: sku
- find:
path: body
do:
- find:
path: '.selected'
do:
- parse
- space_dedupe
- trim
- object_field_set:
object: product
field: category
joinby: "|"
- variable_get: catname
- if:
match: (\S)
do:
- object_field_set:
object: product
field: category
joinby: "|"
- variable_get: catname2
- if:
match: (\S)
do:
- object_field_set:
object: product
field: category
joinby: "|"
- object_save:
name: product
- find:
path: productCategory > childCategories
do:
- variable_clear: catname3
- find:
path: name
slice: 0
do:
- parse
- space_dedupe
- trim
- variable_set: catname3
- find:
path: parentBusinessCatalogItemId
do:
- parse
- if:
match: (\S)
do:
- variable_set: pid
- register_set: http://athleta.gap.com/browse/product.do?pid=<%pid%>&cid=<%cid%>
- walk:
to: value
do:
- variable_clear: isP
- find:
path: script:matches(gap.pageProductData\s*=\s*\{)
do:
- variable_set:
field: isP
value: 1
- find:
path: html
do:
- variable_get: isP
- if:
match: (1)
do:
- object_new: product
- find:
path: head
do:
- eval:
routine: js
body: '(function (){var d = new Date(); return d.toISOString()})();'
- object_field_set:
object: product
field: date
- static_get: url
- object_field_set:
object: product
field: url
- register_set: 'GAP'
- object_field_set:
object: product
field: brand
- find:
path: meta[name="keywords"]
do:
- parse:
attr: content
- object_field_set:
object: product
field: description
- find:
path: script:matches(gap.pageProductData\s*=\s*\{)
do:
- parse:
filter:
- gap\.currentBrand\s*=\s*\"(.+)\"\;
- if:
match: (\S)
do:
- object_field_set:
object: product
field: brand
- parse
- normalize:
routine: replace_substring
args:
var\s*gap\s*=\s*window\.gap\s*\|\|\s*\{\s*\}\;: ''
gap\.pageProductData\s*=\s*: ''
\s*;\s*gap.currentBrand\s*=\s*.*\;: ''
- normalize:
routine: json2xml
- to_block
- find:
path: productimages
do:
- parse:
format: html
- variable_set: imghtml
- find:
path: variants > productstylecolors > productstylecolorimages
do:
- parse
- normalize:
routine: lower
- variable_set: imgpath
- register_set: <%imghtml%>
- to_block
- find:
path: safe_<%imgpath%>
do:
- variable_clear: getit
- find:
path: xlarge
do:
- parse
- if:
match: (\S)
do:
- variable_set:
field: getit
value: 1
- normalize:
routine: url
- object_field_set:
object: product
field: images
joinby: "|"
- variable_get: getit
- if:
match: (1)
else:
- find:
path: large
do:
- parse
- if:
match: (\S)
do:
- variable_set:
field: getit
value: 1
- normalize:
routine: url
- object_field_set:
object: product
field: images
joinby: "|"
- variable_get: getit
- if:
match: (1)
else:
- find:
path: medium
do:
- parse
- if:
match: (\S)
do:
- variable_set:
field: getit
value: 1
- normalize:
routine: url
- object_field_set:
object: product
field: images
joinby: "|"
- variable_get: getit
- if:
match: (1)
else:
- find:
path: small
do:
- parse
- if:
match: (\S)
do:
- variable_set:
field: getit
value: 1
- normalize:
routine: url
- object_field_set:
object: product
field: images
joinby: "|"
- find:
path: body_safe > variants > productstylecolors > colorname
do:
- parse
- if:
match: (\S)
do:
- object_field_set:
object: product
field: variations
joinby: "|"
- find:
path: body_safe > name
do:
- parse
- if:
match: (\S)
do:
- object_field_set:
object: product
field: name
- find:
path: body_safe > currentmaxprice, body_safe > currentminprice
do:
- parse:
filter:
- (\d+\.?\d*)
- if:
match: (\d+)
do:
- object_field_set:
object: product
field: price
type: float
- register_set: USD
- object_field_set:
object: product
field: currency
- find:
path: styleid
slice: 0
do:
- parse
- object_field_set:
object: product
field: sku
- find:
path: body
do:
- find:
path: '.selected'
do:
- parse
- space_dedupe
- trim
- object_field_set:
object: product
field: category
joinby: "|"
- variable_get: catname
- if:
match: (\S)
do:
- object_field_set:
object: product
field: category
joinby: "|"
- variable_get: catname2
- if:
match: (\S)
do:
- object_field_set:
object: product
field: category
joinby: "|"
- variable_get: catname3
- if:
match: (\S)
do:
- object_field_set:
object: product
field: category
joinby: "|"
- object_save:
name: product
Ниже приведен пример датасета с несколькими товарами в формате JSON (для наглядности). Датасет может быть скачан и как CSV, XLSX, XML, и любой другой текстовый формат используя темплейтный подход.
[{
"product": {
"brand": "athleta",
"category": "New Arrivals|CATEGORIES|All New Arrivals",
"currency": "USD",
"date": "2017-12-06T19:35:53.451Z",
"description": "Easy Cozy Karma Jacket, New Arrivals, New Arrivals All New Arrivals, Athleta",
"images": "http://athleta.gap.com/webcontent/0014/295/432/cn14295432.jpg|http://athleta.gap.com/webcontent/0014/295/469/cn14295469.jpg|http://athleta.gap.com/webcontent/0014/295/464/cn14295464.jpg|http://athleta.gap.com/webcontent/0014/295/460/cn14295460.jpg|http://athleta.gap.com/webcontent/0014/509/387/cn14509387.jpg|http://athleta.gap.com/webcontent/0014/088/415/cn14088415.jpg|http://athleta.gap.com/webcontent/0014/295/469/cn14295469.jpg|http://athleta.gap.com/webcontent/0014/295/464/cn14295464.jpg|http://athleta.gap.com/webcontent/0014/295/460/cn14295460.jpg|http://athleta.gap.com/webcontent/0014/509/387/cn14509387.jpg|http://athleta.gap.com/webcontent/0014/130/170/cn14130170.jpg|http://athleta.gap.com/webcontent/0014/295/469/cn14295469.jpg|http://athleta.gap.com/webcontent/0014/295/464/cn14295464.jpg|http://athleta.gap.com/webcontent/0014/295/460/cn14295460.jpg|http://athleta.gap.com/webcontent/0014/509/387/cn14509387.jpg|http://athleta.gap.com/webcontent/0014/068/604/cn14068604.jpg|http://athleta.gap.com/webcontent/0014/295/469/cn14295469.jpg|http://athleta.gap.com/webcontent/0014/295/464/cn14295464.jpg|http://athleta.gap.com/webcontent/0014/295/460/cn14295460.jpg|http://athleta.gap.com/webcontent/0014/509/387/cn14509387.jpg|http://athleta.gap.com/webcontent/0014/295/432/cn14295432.jpg|http://athleta.gap.com/webcontent/0014/295/469/cn14295469.jpg|http://athleta.gap.com/webcontent/0014/295/464/cn14295464.jpg|http://athleta.gap.com/webcontent/0014/295/460/cn14295460.jpg|http://athleta.gap.com/webcontent/0014/509/387/cn14509387.jpg|http://athleta.gap.com/webcontent/0014/088/415/cn14088415.jpg|http://athleta.gap.com/webcontent/0014/295/469/cn14295469.jpg|http://athleta.gap.com/webcontent/0014/295/464/cn14295464.jpg|http://athleta.gap.com/webcontent/0014/295/460/cn14295460.jpg|http://athleta.gap.com/webcontent/0014/509/387/cn14509387.jpg|http://athleta.gap.com/webcontent/0014/130/170/cn14130170.jpg|http://athleta.gap.com/webcontent/0014/295/469/cn14295469.jpg|http://athleta.gap.com/webcontent/0014/295/464/cn14295464.jpg|http://athleta.gap.com/webcontent/0014/295/460/cn14295460.jpg|http://athleta.gap.com/webcontent/0014/509/387/cn14509387.jpg|http://athleta.gap.com/webcontent/0014/068/604/cn14068604.jpg|http://athleta.gap.com/webcontent/0014/295/469/cn14295469.jpg|http://athleta.gap.com/webcontent/0014/295/464/cn14295464.jpg|http://athleta.gap.com/webcontent/0014/295/460/cn14295460.jpg|http://athleta.gap.com/webcontent/0014/509/387/cn14509387.jpg|http://athleta.gap.com/webcontent/0014/295/432/cn14295432.jpg|http://athleta.gap.com/webcontent/0014/295/469/cn14295469.jpg|http://athleta.gap.com/webcontent/0014/295/464/cn14295464.jpg|http://athleta.gap.com/webcontent/0014/295/460/cn14295460.jpg|http://athleta.gap.com/webcontent/0014/509/387/cn14509387.jpg|http://athleta.gap.com/webcontent/0014/088/415/cn14088415.jpg|http://athleta.gap.com/webcontent/0014/295/469/cn14295469.jpg|http://athleta.gap.com/webcontent/0014/295/464/cn14295464.jpg|http://athleta.gap.com/webcontent/0014/295/460/cn14295460.jpg|http://athleta.gap.com/webcontent/0014/509/387/cn14509387.jpg|http://athleta.gap.com/webcontent/0014/130/170/cn14130170.jpg|http://athleta.gap.com/webcontent/0014/295/469/cn14295469.jpg|http://athleta.gap.com/webcontent/0014/295/464/cn14295464.jpg|http://athleta.gap.com/webcontent/0014/295/460/cn14295460.jpg|http://athleta.gap.com/webcontent/0014/509/387/cn14509387.jpg|http://athleta.gap.com/webcontent/0014/068/604/cn14068604.jpg|http://athleta.gap.com/webcontent/0014/295/469/cn14295469.jpg|http://athleta.gap.com/webcontent/0014/295/464/cn14295464.jpg|http://athleta.gap.com/webcontent/0014/295/460/cn14295460.jpg|http://athleta.gap.com/webcontent/0014/509/387/cn14509387.jpg",
"name": "Easy Cozy Karma Jacket",
"price": 118,
"sku": "158372",
"url": "http://athleta.gap.com/browse/product.do?pid=158372&cid=1006482",
"variations": "White Heather|Charcoal Heather|Cassis Heather|Black|White Heather|Charcoal Heather|Cassis Heather|Black|White Heather|Charcoal Heather|Cassis Heather|Black"
}
}
,{
"product": {
"brand": "athleta",
"category": "New Arrivals|CATEGORIES|All New Arrivals",
"currency": "USD",
"date": "2017-12-06T19:35:56.279Z",
"description": "Velour Hoodie, New Arrivals, New Arrivals All New Arrivals, Athleta",
"images": "http://athleta.gap.com/webcontent/0014/120/934/cn14120934.jpg|http://athleta.gap.com/webcontent/0014/121/309/cn14121309.jpg|http://athleta.gap.com/webcontent/0014/449/374/cn14449374.jpg",
"name": "Velour Hoodie",
"price": 118,
"sku": "158403",
"url": "http://athleta.gap.com/browse/product.do?pid=158403&cid=1006482",
"variations": "Charcoal Grey Heather"
}
}
,{
"product": {
"brand": "athleta",
"category": "New Arrivals|CATEGORIES|All New Arrivals",
"currency": "USD",
"date": "2017-12-06T19:35:57.948Z",
"description": "Luxe Stronger Hoodie, New Arrivals, New Arrivals All New Arrivals, Athleta",
"images": "http://athleta.gap.com/webcontent/0012/348/901/cn12348901.jpg|http://athleta.gap.com/webcontent/0012/302/897/cn12302897.jpg|http://athleta.gap.com/webcontent/0014/522/557/cn14522557.jpg|http://athleta.gap.com/webcontent/0012/204/913/cn12204913.jpg|http://athleta.gap.com/webcontent/0014/422/795/cn14422795.jpg|http://athleta.gap.com/webcontent/0014/422/782/cn14422782.jpg|http://athleta.gap.com/webcontent/0014/422/795/cn14422795.jpg|http://athleta.gap.com/webcontent/0012/302/897/cn12302897.jpg|http://athleta.gap.com/webcontent/0014/522/557/cn14522557.jpg|http://athleta.gap.com/webcontent/0012/204/913/cn12204913.jpg|http://athleta.gap.com/webcontent/0012/302/088/cn12302088.jpg|http://athleta.gap.com/webcontent/0012/302/897/cn12302897.jpg|http://athleta.gap.com/webcontent/0014/522/557/cn14522557.jpg|http://athleta.gap.com/webcontent/0012/204/913/cn12204913.jpg|http://athleta.gap.com/webcontent/0014/422/795/cn14422795.jpg|http://athleta.gap.com/webcontent/0012/348/901/cn12348901.jpg|http://athleta.gap.com/webcontent/0012/302/897/cn12302897.jpg|http://athleta.gap.com/webcontent/0014/522/557/cn14522557.jpg|http://athleta.gap.com/webcontent/0012/204/913/cn12204913.jpg|http://athleta.gap.com/webcontent/0014/422/795/cn14422795.jpg|http://athleta.gap.com/webcontent/0014/422/782/cn14422782.jpg|http://athleta.gap.com/webcontent/0014/422/795/cn14422795.jpg|http://athleta.gap.com/webcontent/0012/302/897/cn12302897.jpg|http://athleta.gap.com/webcontent/0014/522/557/cn14522557.jpg|http://athleta.gap.com/webcontent/0012/204/913/cn12204913.jpg|http://athleta.gap.com/webcontent/0012/348/901/cn12348901.jpg|http://athleta.gap.com/webcontent/0012/302/897/cn12302897.jpg|http://athleta.gap.com/webcontent/0014/522/557/cn14522557.jpg|http://athleta.gap.com/webcontent/0012/204/913/cn12204913.jpg|http://athleta.gap.com/webcontent/0014/422/795/cn14422795.jpg|http://athleta.gap.com/webcontent/0014/422/782/cn14422782.jpg|http://athleta.gap.com/webcontent/0014/422/795/cn14422795.jpg|http://athleta.gap.com/webcontent/0012/302/897/cn12302897.jpg|http://athleta.gap.com/webcontent/0014/522/557/cn14522557.jpg|http://athleta.gap.com/webcontent/0012/204/913/cn12204913.jpg",
"name": "Luxe Stronger Hoodie",
"price": 148,
"sku": "456789",
"url": "http://athleta.gap.com/browse/product.do?pid=456789&cid=1006482",
"variations": "Oatmeal Heather|Black Multi|Black|Oatmeal Heather|Black Multi|Oatmeal Heather|Black Multi"
}
}
,{
"product": {
"brand": "athleta",
"category": "New Arrivals|CATEGORIES|All New Arrivals",
"currency": "USD",
"date": "2017-12-06T19:36:03.291Z",
"description": "Stronger Long Hoodie, New Arrivals, New Arrivals All New Arrivals, Athleta",
"images": "http://athleta.gap.com/webcontent/0014/365/879/cn14365879.jpg|http://athleta.gap.com/webcontent/0014/365/874/cn14365874.jpg|http://athleta.gap.com/webcontent/0014/330/558/cn14330558.jpg|http://athleta.gap.com/webcontent/0014/365/856/cn14365856.jpg|http://athleta.gap.com/webcontent/0014/365/874/cn14365874.jpg|http://athleta.gap.com/webcontent/0014/330/558/cn14330558.jpg",
"name": "Stronger Long Hoodie",
"price": 138,
"sku": "158356",
"url": "http://athleta.gap.com/browse/product.do?pid=158356&cid=1006482",
"variations": "Light Grey Multi|Black Multi"
}
}]