feed.collector

Documentation for eth_defi.feed.collector Python module.

Vault post collection and feed normalisation.

Functions

build_linkedin_rss_feed_urls(company_id, ...)

Build live feed URLs for a LinkedIn company id.

build_twitter_rss_feed_urls(handle, base_urls, *)

Build live feed URLs for a Twitter handle.

collect_posts(db, sources, *[, ...])

Collect posts for all configured sources and persist them in DuckDB.

collect_posts_for_source(source, *, ...[, ...])

Collect posts for one tracked source.

collect_twitter_list_posts(db, sources, *, ...)

Collect Twitter/X posts through a single X list timeline read.

fetch_feed_proxy_rotator()

Fetch an optional Webshare proxy rotator for feed fetching.

load_feed_proxy_rotator()

Backwards-compatible alias for fetch_feed_proxy_rotator().

Classes

CollectedSourceResult

Detailed collection result for one tracked source.

CollectorRunSummary

Summary counters for one collector run.

Exceptions

AllBridgesFailedError

Raised when every bridge URL for a social feed source fails.

exception AllBridgesFailedError

Bases: RuntimeError

Raised when every bridge URL for a social feed source fails.

Parameters
  • source_label – Human-readable source type label for error messages.

  • canonical_url – Canonical source URL for diagnostics.

  • bridge_errors – List of (url, http_status_or_none) for each attempt. None for the status code indicates a non-HTTP failure such as a timeout.

__init__(source_label, canonical_url, bridge_errors)
Parameters
__new__(**kwargs)
add_note(note, /)

Add a note to the exception

property indicates_auth_block: bool

Return True when at least one bridge returned HTTP 503 (LinkedIn auth barrier).

When all bridges fail and at least one specifically returns 503, LinkedIn is most likely redirecting unauthenticated requests to the login page for this company. Bridges that are simply down (502 or connection error) do not indicate anything about LinkedIn’s stance on the company page, so they are not required to return 503.

with_traceback(tb, /)

Set self.__traceback__ to tb and return self.

class CollectedSourceResult

Bases: object

Detailed collection result for one tracked source.

__init__(feeder_id, name, role, source_type, status, posts_fetched=0, posts_inserted=0, last_post_published_at=None, error=None, auth_blocked=False)
Parameters
Return type

None

class CollectorRunSummary

Bases: object

Summary counters for one collector run.

__init__(sources_loaded=0, sources_succeeded=0, sources_failed=0, feeders_skipped=0, posts_fetched=0, posts_inserted=0, source_results=None, twitter_method=None, rss_duration_seconds=None, linkedin_duration_seconds=None, twitter_duration_seconds=None, total_duration_seconds=None)
Parameters
Return type

None

build_linkedin_rss_feed_urls(company_id, url_templates)

Build live feed URLs for a LinkedIn company id.

Parameters
Return type

list[str]

build_twitter_rss_feed_urls(handle, base_urls, *, url_templates=None)

Build live feed URLs for a Twitter handle.

Parameters
Return type

list[str]

collect_posts(db, sources, *, max_posts_per_source=20, max_workers=8, request_timeout=20.0, request_delay_seconds=1.0, twitter_rss_base_urls=None, twitter_url_templates=None, linkedin_url_templates=None, proxy_rotator=None, max_proxy_rotations=3, twitter_bearer_token=None, twitter_user_cache=None, label='')

Collect posts for all configured sources and persist them in DuckDB.

Parameters
Return type

eth_defi.feed.collector.CollectorRunSummary

collect_posts_for_source(source, *, max_posts_per_source, request_timeout, twitter_rss_base_urls, twitter_url_templates=None, linkedin_url_templates=None, proxy_rotator=None, max_proxy_rotations=3, twitter_bearer_token=None, twitter_user_cache=None)

Collect posts for one tracked source.

Parameters
Return type

list[eth_defi.feed.database.CollectedPost]

collect_twitter_list_posts(db, sources, *, list_id, bearer_token, twitter_user_cache, max_tweets, fallback_max_tweets=5, label='Twitter list')

Collect Twitter/X posts through a single X list timeline read.

The list timeline API returns tweets across all list members in reverse chronological order. This lets production collection avoid one API call per tracked account while still storing posts under the account-specific tracked source rows.

When a handle has no tweets in the list timeline and has no stored last_post_published_at (i.e. it is a brand-new handle whose first scan returned nothing), the collector falls back to a single individual timeline read. This seeds the timestamp and stores a small number of recent posts without firing per-account API calls on steady-state runs where the list stopped early because all recent tweets were already known.

Parameters
  • db (eth_defi.feed.database.VaultPostDatabase) – Vault post database.

  • sources (Sequence[eth_defi.feed.sources.TrackedPostSource]) – Twitter tracked sources whose handles are represented in the X list.

  • list_id (str) – Numeric X list ID.

  • bearer_token (str) – X API bearer token used for list timeline reads.

  • twitter_user_cache (eth_defi.feed.twitter_api.TwitterUserCache) – Cache containing handle-to-user-ID mappings.

  • max_tweets (int) – Maximum tweets to read from the list timeline.

  • fallback_max_tweets (int) – Maximum tweets to fetch per account when the list timeline returns zero results for that handle. Used to populate last_post_published_at for inactive accounts. Defaults to 5.

  • label (str) – Dashboard label for this collection phase.

Returns

Collector run summary with per-source insert counters.

Return type

eth_defi.feed.collector.CollectorRunSummary

fetch_feed_proxy_rotator()

Fetch an optional Webshare proxy rotator for feed fetching.

Return type

Optional[eth_defi.event_reader.webshare.ProxyRotator]

load_feed_proxy_rotator()

Backwards-compatible alias for fetch_feed_proxy_rotator().

Return type

Optional[eth_defi.event_reader.webshare.ProxyRotator]